57
Noun Countability Timothy Baldwin and Dominic Widdows

Timothy Baldwin and Dominic Widdows

  • Upload
    others

  • View
    3

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Timothy Baldwin and Dominic Widdows

Noun Countability

Timothy Baldwin and Dominic Widdows

Page 2: Timothy Baldwin and Dominic Widdows

1 Noun Countability

Background

• Countability is a syntactic property of the noun phrase in

languages such as English, Dutch, Albanian and Tagalog

• In generation used to decide between:

a cake, cake, a piece of cake

• In analysis, helps to resolve ambiguity:

? I need a paper by this evening (academic/newspaper)

? I need some paper by this evening (material)

? I need the paper by this evening (ambiguous)

31 October, 2003

Page 3: Timothy Baldwin and Dominic Widdows

2 Noun Countability

Noun Phrase Countability

• Semantically motivated:

? bounded, indivisible individuals (+b)

prototypically countable: a dog, two dogs

? unbounded, divisible substances (−b)

prototypically uncountable: gold

31 October, 2003

Page 4: Timothy Baldwin and Dominic Widdows

3 Noun Countability

Countability Classes

• countable: book, button, person (one book, two books)

• uncountable: equipment, gold, wood (*oneequipment, much equipment, *two equipments)

• plural only: clothes, manners, outskirts (*oneclothes, clothes horse)

• bipartite: glasses, scissors, trousers (*one scissors,scissor kick, pair of scissors)

31 October, 2003

Page 5: Timothy Baldwin and Dominic Widdows

4 Noun Countability

Applications

• Determination of countability of unknown nouns (e.g.

acyclovir, coagulopathy)

• Detection of countability anomalies in multiword

expressions (e.g. public relations, cat’s cradle)

• Extraction of English determinerless PPs (e.g. by bus,at sea)

• Key component of noun type hierarchy in precision

grammars (e.g. ERG, Alpino)

31 October, 2003

Page 6: Timothy Baldwin and Dominic Widdows

5 Noun Countability

Learning the Countability ofEnglish Nouns from Corpus

Data

Timothy Baldwin and Francis BondACL 2003

31 October, 2003

Page 7: Timothy Baldwin and Dominic Widdows

6 Noun Countability

Learning Countability

• Observation: the countability properties of a noun type

are reflected in its corpus token occurrences:

... Cezanne snarling like a dog and then ...

... doing an impression of a rabid dog.

... with a pack of dogs running beside them.

Amnesty International has received information ...

Recent information from former detainees ...

... researchers often uncover information ...

31 October, 2003

Page 8: Timothy Baldwin and Dominic Widdows

7 Noun Countability

Case in Point

Acyclovir is a specifically anti-viral drug ...

Acyclovir has been developed and marketed by ...

Acyclovir given intravenously, ...

Coagulopathy is a well recognised complication ...

... may explain why coagulopathy after shunting is ...

... could stimulate a coagulopathy ...

... is also probably responsible for a coagulopathy ...

... a patient with a coagulopathy.

31 October, 2003

Page 9: Timothy Baldwin and Dominic Widdows

8 Noun Countability

Methodology

• Identify lexical and/or constructional features associated

with each countability class

• Determine the relative corpus occurrence of the features

for each noun

• Use the noun feature vectors to classify the noun as

a member of each of the countability classes, training

from gold-standard countability data

31 October, 2003

Page 10: Timothy Baldwin and Dominic Widdows

9 Noun Countability

Feature Clusters

Head noun number:1D target noun number as head of

NP (e.g. a shaggy dog = SINGULAR)

Modifier noun number:1D target noun number as

modifier in NP (e.g. dog food = SINGULAR)

Subject–verb agreement:2D target noun number as

subject vs. verb number agreement (e.g. the dog barks

= 〈SINGULAR,SINGULAR〉)Coordinate noun number:2D target noun number vs.

31 October, 2003

Page 11: Timothy Baldwin and Dominic Widdows

10 Noun Countability

the number of the head nouns of conjuncts (e.g. dogs

and mud = 〈PLURAL,SINGULAR〉)N of N constructions:2D number of N vs. type of N

(e.g. the type of dog = 〈TYPE,SINGULAR〉); total of 11

N types for use in this feature cluster (e.g. COLLECTIVE,

LACK, TEMPORAL).

Occurrence in PPs:2D the presence or absence of a

determiner (±DET) in singular head complement of PP

(e.g. per dog = 〈per ,−DET〉).Pronoun co-occurrence:2D what pronouns occur in the

31 October, 2003

Page 12: Timothy Baldwin and Dominic Widdows

11 Noun Countability

same sentence as singular and plural instances (e.g. The

dog ate its dinner = 〈its,SINGULAR〉). Approximation of

pronoun co-indexation.

Singular determiners:1D singular-selecting determiners

(e.g. a dog = a). Two types: countable (e.g. another,

each), uncountable (e.g. much, little).

Plural determiners:1D plural-selecting determiners (e.g.

few dogs = few).

Non-bounded determiners:2D non-bounded determiner

vs. noun number (e.g. more dogs = 〈more,PLURAL〉).

31 October, 2003

Page 13: Timothy Baldwin and Dominic Widdows

12 Noun Countability

Feature Values

1D corpfreq(f s,w) =freq(f s|w)freq(∗) (1)

wordfreq(f s,w) =freq(f s|w)freq(w)

(2)

featfreq(f s,w) =freq(f s|w)∑

ifreq(f i|w)(3)

2D featdimfreq(f s,t,w) =freq(f s,t|w)∑

ifreq(f i,t|w)(4)

featdimfreq(f s,t,w) =freq(f s,t|w)∑jfreq(f s,j|w)

(5)

31 October, 2003

Page 14: Timothy Baldwin and Dominic Widdows

13 Noun Countability

1-D case

corpfreq(f ,w)

wordfreq(f ,w)

featfreq(f ,w)

31 October, 2003

Page 15: Timothy Baldwin and Dominic Widdows

14 Noun Countability

1-D case

corpfreq(f ,w)

wordfreq(f ,w)

featfreq(f ,w)

31 October, 2003

Page 16: Timothy Baldwin and Dominic Widdows

15 Noun Countability

2-D casecorpfreq(f ,,w)

wordfreq(f ,,w)

featfreq(f ,,w)

31 October, 2003

Page 17: Timothy Baldwin and Dominic Widdows

16 Noun Countability

2-D case

featdimfreq(f ,,w)

31 October, 2003

Page 18: Timothy Baldwin and Dominic Widdows

17 Noun Countability

2-D case

featdimfreq(f ,,w)

31 October, 2003

Page 19: Timothy Baldwin and Dominic Widdows

18 Noun Countability

2-D case

corpfreq(f ∗,,w)

wordfreq(f ∗,,w)

featfreq(f ∗,,w)

31 October, 2003

Page 20: Timothy Baldwin and Dominic Widdows

19 Noun Countability

2-D case

corpfreq(f ,∗,w)wordfreq(f ,∗,w)featfreq(f ,∗,w)

31 October, 2003

Page 21: Timothy Baldwin and Dominic Widdows

20 Noun Countability

Feature Value Extraction

• POS tagging and templates

? extract features with regexp-base templates

• Full text chunking

? conservative inter-chunk attachment disambiguation

• Robust parsing (RASP)

• Concatenated feature values from three systems

31 October, 2003

Page 22: Timothy Baldwin and Dominic Widdows

21 Noun Countability

Classifier architecture

• Training data: generated from combination of ALT-J/E

and COMLEX (5,943 common nouns in BNC)

? positive examples in both ALT-J/E and COMLEX

? negative examples in neither ALT-J/E nor COMLEX

• Test data: nouns with ≥ 10 BNC instances for all 3

methods (20,530 common nouns)

• Four binary supervised classifiers, one per countability

class (learned using TiMBL and k-NN)

31 October, 2003

Page 23: Timothy Baldwin and Dominic Widdows

22 Noun Countability

Supervised Classifiers: Basic Overview

Training data

Learner

Classifier

Test data

Test instance?

A

B

B

Classification

A

31 October, 2003

Page 24: Timothy Baldwin and Dominic Widdows

23 Noun Countability

k-NN in TiMBL

• Distance between feature vectors X and Y based on“overlap metric”:

∆(X, Y ) =∑

δ(xi, yi)

δ(xi, yi) =

| xi−yimaxi−mini

| if numeric, else0 if xi = yi

1 if xi 6= yi

• Retrieve the neighbours at the k closest distances and

classify according to the most common class amongst

them

31 October, 2003

Page 25: Timothy Baldwin and Dominic Widdows

24 Noun Countability

Cross Validation: Input

• Take training data:

31 October, 2003

Page 26: Timothy Baldwin and Dominic Widdows

25 Noun Countability

Cross Validation: Partitioning

• Split up into N equal-sized (optionally stratified)

partitions P i:

P

P

P

P

P

P

P

P

P

P

31 October, 2003

Page 27: Timothy Baldwin and Dominic Widdows

26 Noun Countability

Cross Validation: Fold 1

• For each i = 1...N , take P i as the test data and

{P j : j 6= i} as the training data

P

P

P

P

P

P

P

P

P

P

31 October, 2003

Page 28: Timothy Baldwin and Dominic Widdows

27 Noun Countability

Cross Validation: Fold 2

• For each i = 1...N , take P i as the test data and

{P j : j 6= i} as the training data

P

P

P

P

P

P

P

P

P

P

31 October, 2003

Page 29: Timothy Baldwin and Dominic Widdows

28 Noun Countability

Cross Validation: Fold 3

• For each i = 1...N , take P i as the test data and

{P j : j 6= i} as the training data

P

P

P

P

P

P

P

P

P

P

31 October, 2003

Page 30: Timothy Baldwin and Dominic Widdows

29 Noun Countability

Cross Validation: Fold i

• And so on ...

31 October, 2003

Page 31: Timothy Baldwin and Dominic Widdows

30 Noun Countability

Cross Validation: Evaluate

• Calculate classification accuracy, precision, recall, F-

score, ... according to the average across the N

iterations

• Effective method of minimising training bias and test

variance

31 October, 2003

Page 32: Timothy Baldwin and Dominic Widdows

31 Noun Countability

Cross-validated Countability Results

• Good results (particularly for countable and uncountable

nouns), well above the baseline accuracy in each case

• Best results for combined method (concatenation of

three pre-processors)

31 October, 2003

Page 33: Timothy Baldwin and Dominic Widdows

32 Noun Countability

Manual Evaluation over Open Data

• Classifier precision of 94.6% relative to lexicons

• Manually annotated 100 nouns from the test data:

? Agreement between classified and hand-annotated

countabilities 92.4%

? Agreement between classified and dictionary

countabilities 92.4%

• Classifiers agree with corpus as well as lexicons

31 October, 2003

Page 34: Timothy Baldwin and Dominic Widdows

33 Noun Countability

Reflections

• Impressive results, but still room for improvement

(particularly for the less-populated countability classes)

• Boundary between motivated countabilities and

conversions (e.g. chicken vs. elephant vs. dog)

• Difficulties caused by MWEs (e.g. cat’s cradle)

• Sense and frequency effects (e.g. information)

31 October, 2003

Page 35: Timothy Baldwin and Dominic Widdows

34 Noun Countability

Using an ontology todetermine English countability

Francis Bond and Caitlin Vatikiotis-BatesonCOLING 2002

31 October, 2003

Page 36: Timothy Baldwin and Dominic Widdows

35 Noun Countability

Semantic Predictability of Countability

• How far is English countability predictable from

meaning?

• Countability is to some degree deterministic given the

semantics of a word:

dog, pooch, canine, mongrel, ...BUT suitcases vs. luggage, leaves vs. foliage, etc.

31 October, 2003

Page 37: Timothy Baldwin and Dominic Widdows

36 Noun Countability

Case in Point

Coagulopathy: group of conditions of the blood clotting

(coagulation) system in which bleeding is prolonged and

excessive, a bleeding disorder

Acyclovir: antiviral drug

31 October, 2003

Page 38: Timothy Baldwin and Dominic Widdows

37 Noun Countability

Word Denotation and Countability

• Knowing the referent is not enough, e.g. scales

1. Thought of as being made of two arms: (British)

a pair of scales

2. Thought of as a set of numbers: (Australian)

a set of scales

3. Thought of as discrete whole objects: (American)

one scale/two scales

31 October, 2003

Page 39: Timothy Baldwin and Dominic Widdows

38 Noun Countability

Methodology

• Take an existing ontology and determine the default

countability for each synset (semantic class)

• Test how reliably defaults predict the countability of

members of each synset

• Base experimentation on the ALT-J/E semantic transfer

lexicon and ontology

31 October, 2003

Page 40: Timothy Baldwin and Dominic Widdows

39 Noun Countability

Lexicon

• ALT-J/E’s semantic transfer lexicon

• 71,833 linked Japanese-English noun pairs

31 October, 2003

Page 41: Timothy Baldwin and Dominic Widdows

40 Noun Countability

The Goi-Taikei Ontology

• A rich ontology and wide coverage of Japanese

• Used in many NLP applications such as MT

• 2,710 semantic classes (12-level tree structure) for

common nouns

• Constructed from translation pairs (without countability

in mind)

31 October, 2003

Page 42: Timothy Baldwin and Dominic Widdows

41 Noun Countability

Top Four Levels of Ontology

31 October, 2003

Page 43: Timothy Baldwin and Dominic Widdows

42 Noun Countability

Noun Countability Preferences in ALT-J/E

Noun Countability Code Example Default Default # %Preference Number Classifier

fully CO knife sg — 47,255 65.8countable

strongly BC cake sg — 3,110 4.3countable

weakly BU beer sg — 3,377 4.7countable

uncountable UC furniture sg piece 15,435 21.5

plural only PT scissors pl pair 2,107 2.9

31 October, 2003

Page 44: Timothy Baldwin and Dominic Widdows

43 Noun Countability

Experiment

• Treat every combination of semantic classes as a

different semantic class.

• Most frequent NCP is assigned to all members of a

class.

? Ties are resolved as follows: fully countable beats

strongly countable beats weakly countablebeats uncountable beats plural only.

• Baseline (all fully countable = 65.8%)

31 October, 2003

Page 45: Timothy Baldwin and Dominic Widdows

44 Noun Countability

Example

• Semantic Class 910:tableware

? crockery ⇔ toukirui (UC)? dinner set ⇔ youshokki (CO)? tableware ⇔ shokki (UC)? Western-style tableware ⇔ youshokki (UC)

• The most common NCP is UCAssociated uncountable with 910:tableware.

• This predicts the NCP correctly 75% of the time.

31 October, 2003

Page 46: Timothy Baldwin and Dominic Widdows

45 Noun Countability

Results

Conditions % Range Baseline

Training=Test 77.9 76.8–78.6 65.8

10-fold Cross Validation 71.2 69.8–72.1 65.8

• 11.6% given default value (fully countable)

31 October, 2003

Page 47: Timothy Baldwin and Dominic Widdows

46 Noun Countability

Discussion

• Semantics predicts countability around 78% of the time

: supports hypothesis that countability is semantically

motivated

• Less successful than corpus-based countability learning

• Problems of granularity/translation-orientation of

lexicon

• Problems with noise in lexicon

31 October, 2003

Page 48: Timothy Baldwin and Dominic Widdows

47 Noun Countability

The Ins and Outs of DutchNoun Countability

Classification

Timothy Baldwin and Leonoor van der BeekALTW2003

31 October, 2003

Page 49: Timothy Baldwin and Dominic Widdows

48 Noun Countability

Crosslinguistic Predictability of Countability

• In linguistically-related languages such as English and

Dutch, countability generally patterns the same way:

? same basic behaviour of translation-equivalent

lexical/syntactic markers of countability (e.g. one

dog ⇀↽ een hond, some rice ⇀↽ een beetje rijst)

? translation pairs often have same countability: car ⇀↽

auto [countable], food ⇀↽ eten [uncountable], BUT

thunderstorm [countable] vs. onweer [uncountable]

31 October, 2003

Page 50: Timothy Baldwin and Dominic Widdows

49 Noun Countability

Out-of- vs. In-language Classification

• Given high-quality training data in a closely-related

language (English – COMLEX +ALT-J/E) and medium-

quality data in the target language (Dutch – Alpinolexicon):

? which generates the best classifier?

? what is the best form of crosslingual mapping?

• Focus on the task of Dutch noun countability

classification

31 October, 2003

Page 51: Timothy Baldwin and Dominic Widdows

50 Noun Countability

Approaches to Monolingual Classification

• Evidence-based classification: base classification on

token evidence for each countability class

• Distribution-based classification: same as for EN-EN

classification task (Baldwin and Bond (2003))

31 October, 2003

Page 52: Timothy Baldwin and Dominic Widdows

51 Noun Countability

Approaches to Crosslingual Classification

• Corpus occurrence-based classification (binary vs.

multiclass):

? cluster-to-cluster classification: EN and ND feature

clusters pattern the same

? feature-to-feature classification: EN and ND

features pattern the same (all features vs. partitions

of feature space)

31 October, 2003

Page 53: Timothy Baldwin and Dominic Widdows

52 Noun Countability

• Translation-based classification: countability

is preserved under translation (e.g. car ⇀↽ auto

[countable])

• Transliteration-based classification: countability

is preserved under transliteration (e.g. paranoia ⇀↽

paranoia [uncountable])

• System combination: classify according to combined

outputs of individual methods

? crosslingual + unsupervised monolingual

? crosslingual + monolingual

31 October, 2003

Page 54: Timothy Baldwin and Dominic Widdows

53 Noun Countability

Results

• Better results for crosslingual than monolingual

classification (!)

• Classifiers produce countability results more consistent

with corpus occurrence than Alpino lexicon

• Translation and transliteration are excellent predictors

of countability

• Semantics in crosslingual countability classification?

31 October, 2003

Page 55: Timothy Baldwin and Dominic Widdows

54 Noun Countability

A Preview of Results from EuroWordNet

Dutch Alpino English dict+learnedcount uncount count uncount

Dutch Alpino 0.75 0.37 0.87 0.47

Dutch annotated 0.76 0.44 0.90 0.75

English annotated 0.64 0.49 0.58 0.47

English dict+learned 0.63 0.33 0.62 0.47

31 October, 2003

Page 56: Timothy Baldwin and Dominic Widdows

55 Noun Countability

Final Reflections

• Demonstration of types of methods that can be used to

determine noun type countability

? distribution-based

? semantics/sense-based

? translation/transliteration-based

• Where next? Watch this space!

31 October, 2003

Page 57: Timothy Baldwin and Dominic Widdows

56 Noun Countability

Acknowledgements

• Thanks to Francis Bond and Caitlin Vatikiotis-Bateson

for sharing their wonderful slides and graphics!

31 October, 2003