32
An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model Pierpaolo Basile, Annalina Caputo and Giovanni Semeraro [email protected] Department of Computer Science - SWAP Research Group University of Bari Aldo Moro (ITALY) Coling 2014, Dublin, 27th-29th August 2014 A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 1 / 21

COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Embed Size (px)

DESCRIPTION

An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Citation preview

Page 1: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

An Enhanced Lesk Word Sense DisambiguationAlgorithm through a Distributional Semantic Model

Pierpaolo Basile, Annalina Caputo and Giovanni Semeraro

[email protected] of Computer Science - SWAP Research Group

University of Bari Aldo Moro (ITALY)

Coling 2014, Dublin, 27th-29th August 2014

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 1 / 21

Page 2: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Motivations Problem

One word... many meanings

BANK1 Sloping land (especially the slope beside a body of water)

2 A financial institution that accepts deposits and channels the money into lendingactivities

3 A long ridge or pile

4 ...

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 2 / 21

Page 3: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Motivations Lesk WSD

Simple Lesk approach

Insight

Select the meaning whose gloss maximizes the context overlap

Example

The bank keeps my money

1 Sloping land (especially the slope beside a body of water)

2 A financial institution that accepts deposits and channels the money into lendingactivities

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 3 / 21

Page 4: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Motivations Lesk WSD

Simple Lesk approach

Insight

Select the meaning whose gloss maximizes the context overlap

Example

The bank keeps my money

1 Sloping land (especially the slope beside a body of water) ⇒ overlap=0

2 A financial institution that accepts deposits and channels the money into lendingactivities ⇒ overlap=1

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 3 / 21

Page 5: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Motivations Lesk WSD

Simple Lesk approach

Issues

1 Sense definition is short ⇒ Reduced chances of matching

2 Overlap based on string matching ⇒ Semantically related words are considereddifferently

3 No knowledge about senses usage

Lesk mismatch

Sentence to disambiguate

he cashed a check at the bank

Right sense definition

A financial institution that accepts deposits

and channels the money into lending activities

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 4 / 21

Page 6: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Motivations Lesk WSD

Simple Lesk approach

Issues

1 Sense definition is short ⇒ Reduced chances of matching

2 Overlap based on string matching ⇒ Semantically related words are considereddifferently

3 No knowledge about senses usage

Lesk mismatch

Sentence to disambiguate

he cashed a check at the bank

Right sense definition

A financial institution that accepts deposits

and channels the money into lending activities

overlap=0

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 4 / 21

Page 7: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Solution Distributional Lesk

Idea

Solutions

1 Sense definition is short ⇒ Gloss expansion through related meanings

2 Overlap is based on string matching ⇒ Similarity computed in a WordSpace

3 No knowledge about senses usage ⇒ Exploiting sense annotated corpus

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 5 / 21

Page 8: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Solution Distributional Lesk

Idea

Solutions

1 Sense definition is short ⇒ Gloss expansion through related meanings

Gloss Expansion

Sentence to disambiguate

he cashed a check at the bank

A financial institution that accepts deposits and channels the money into lending activities+

A financial institution that accepts demand deposits and makes loans and provides other servicesfor the public... One of 12 regional banks that monitor and act as depositories for banks in theirregion... A corporation gaining financial control over another corporation or financial institution

through a payment in cash or an exchange of stock...

overlap=1

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 5 / 21

Page 9: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Solution Distributional Lesk

Idea

Solutions

2 Overlap is based on string matching ⇒ Similarity computed in a WordSpace

Gloss Expansion

Sentence to disambiguate

that bank holds the mortgage on my home

A financial institution that accepts deposits and channels the money into lending activities+

A financial institution that accepts demand deposits and makes loans and provides other servicesfor the public... One of 12 regional banks that monitor and act as depositories for banks in theirregion... A corporation gaining financial control over another corporation or financial institution

through a payment in cash or an exchange of stock...

overlap=0

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 5 / 21

Page 10: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Solution Distributional Lesk

Idea

Solutions

2 Overlap is based on string matching ⇒ Similarity computed in a WordSpace

Gloss Expansion

Sentence to disambiguate

that bank holds the mortgage on my home

A financial institution that accepts deposits and

channels the money into lending activities + A

financial institution that accepts demand

deposits and makes loans and provides other

services for the public... One of 12 regional

banks that monitor and act as depositories for

banks in their region... A corporation gaining

financial control over another corporation or

financial institution through a payment in cash

or an exchange of stock... sloping

sideground

mortgageloans

lending

depositmoney

financial

cash

payment

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 5 / 21

Page 11: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Solution Gloss expansion

Gloss expansion

Leavening on a semantic network

• Concatenate recursively glosses of related synsets until a depth d is reached

• Exclude “antonym” relation

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 6 / 21

Page 12: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Solution Gloss expansion

Term weighting

Idea

Term relevance depends on both its frequency and the distance d of the related synset

Solutions

Inverse gloss frequency (IGF ) Words occurring in all the extended glosses associatedwith the target word poorly characterize the meaning description

Distance weight Inversely proportional to the distance in the network (number of edges)between the target synset and the related synset

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 7 / 21

Page 13: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Solution Gloss expansion

Term weighting

Idea

Term relevance depends on both its frequency and the distance d of the related synset

Solutions

Inverse gloss frequency (IGF ) Words occurring in all the extended glosses associatedwith the target word poorly characterize the meaning description

Distance weight Inversely proportional to the distance in the network (number of edges)between the target synset and the related synset

Bank

1 Sloping land (especially the slope beside a body of water)

2 A financial institution that accepts deposits and channels the money into lendingactivities...

8 A container (usually with a slot in the top) for keeping money at home

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 7 / 21

Page 14: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Solution Gloss expansion

Term weight

Inverse gloss frequency

IGFk = 1 + log2|Si |gf ∗k

(1)

gf ∗k is the number of extended glosses that contain a word wk

Term weight

Weight for word wk appearing h times in the extended gloss g∗ij is given by

weight(wk , g∗ij ) =

h∑ 1

1 + d× IGFk (2)

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 8 / 21

Page 15: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Solution WordSpace

Distributional Semantic Models (DSMs)

• You shall know a word by thecompany it keeps!

• Words are represented aspoints in a geometric space

• Words are related if they areclose in that space

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 9 / 21

Page 16: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Solution WordSpace

Overlap in DSM

• Gloss as a vector: weighted vector sum of terms occurring in the expanded gloss

• Context as a vector: vector sum of the target surrounding words

• Compute the overlap as the cosine similarity between gloss vector and contextvector

bank hold mortgage home

financial institution accept deposit channel money lend activity...

sloping land especially slope beside bodywater...

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 10 / 21

Page 17: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Solution Sense Distribution

Sense distribution

Insight

Analyze the distribution of meanings according to each word

Solution

p(sij |wi ) =t(wi , sij) + 1

#wi + |Si |(3)

t(wi , sij ): number of times the word wi is tagged with sij#wi : number of occurrences of wi

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 11 / 21

Page 18: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Solution Methodology

Shaking the ingredients

1 For each word retrieve the list of meanings

2 Expand the glosses and build for each expanded gloss the corresponding vector

3 Create the context vector considering surrounding words

4 Compute the overlap in DSM

5 Combine the overlap with sense distribution

6 Select the meaning whose extended gloss has the maximum overlap

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 12 / 21

Page 19: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Solution Methodology

Shaking the ingredients

1 For each word retrieve the list of meanings

2 Expand the glosses and build for each expanded gloss the corresponding vector

3 Create the context vector considering surrounding words

4 Compute the overlap in DSM

5 Combine the overlap with sense distribution

6 Select the meaning whose extended gloss has the maximum overlap

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 12 / 21

Page 20: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Solution Methodology

Shaking the ingredients

1 For each word retrieve the list of meanings

2 Expand the glosses and build for each expanded gloss the corresponding vector

3 Create the context vector considering surrounding words

4 Compute the overlap in DSM

5 Combine the overlap with sense distribution

6 Select the meaning whose extended gloss has the maximum overlap

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 12 / 21

Page 21: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Solution Methodology

Shaking the ingredients

1 For each word retrieve the list of meanings

2 Expand the glosses and build for each expanded gloss the corresponding vector

3 Create the context vector considering surrounding words

4 Compute the overlap in DSM

5 Combine the overlap with sense distribution

6 Select the meaning whose extended gloss has the maximum overlap

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 12 / 21

Page 22: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Solution Methodology

Shaking the ingredients

1 For each word retrieve the list of meanings

2 Expand the glosses and build for each expanded gloss the corresponding vector

3 Create the context vector considering surrounding words

4 Compute the overlap in DSM

5 Combine the overlap with sense distribution

6 Select the meaning whose extended gloss has the maximum overlap

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 12 / 21

Page 23: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Solution Methodology

Shaking the ingredients

1 For each word retrieve the list of meanings

2 Expand the glosses and build for each expanded gloss the corresponding vector

3 Create the context vector considering surrounding words

4 Compute the overlap in DSM

5 Combine the overlap with sense distribution

6 Select the meaning whose extended gloss has the maximum overlap

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 12 / 21

Page 24: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Evaluation Goal

Evaluation

Goals

Comparing our system with respect to

1 Simplified Lesk approach

2 Other task participants

Evaluate the system with and without sense distribution

• Sense distribution linearly combined with the cosine similarity score

Dataset

• Dataset: Task-12 of SemEval-2013 Multilingual Word Sense Disambiguation

• Sense inventory: BabelNet

• Metrics: F-measure

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 13 / 21

Page 25: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Evaluation System setup

System setup

• Developed in JAVA relying on BabelNet API 1.1.11

• Lucene analyzer to tokenize both glosses and the context, Snowball library2

stemming

• Latent Semantic Analysis for building DSM considering the most 100, 000 frequent

words

• BNC corpus for English• Wikipedia dump for Italian

• Synset distance d is set to 1

• Several context dimension: 3, 5, 10, 20 and the whole text

• Combination factor for cosine similarity and sense distribution: 0.5

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 14 / 21

Page 26: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Evaluation Results

English

Run ContextSize SenseDistr . F

MFS - - 0.656

EN.LESK.1 3 N 0.525

EN.LESK.6 3 Y 0.633

EN.DSM.1 3 N 0.536

EN.DSM.2 5 N 0.605

EN.DSM.3 10 N 0.633

EN.DSM.4 20 N 0.650

EN.DSM.5 W N 0.687

EN.DSM.6 3 Y 0.669

EN.DSM.7 5 Y 0.677

EN.DSM.8 10 Y 0.689

EN.DSM.9 20 Y 0.696

EN.DSM.10 W Y 0.715

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 15 / 21

Page 27: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Evaluation Results

Italian

Run ContextSize SenseDistr . F

MFS - - 0.572

IT.LESK.2 5 N 0.530

IT.LESK.10 W Y 0.607

IT.DSM.1 3 N 0.610

IT.DSM.2 5 N 0.607

IT.DSM.3 10 N 0.626

IT.DSM.4 20 N 0.628

IT.DSM.5 W N 0.633

IT.DSM.6 3 Y 0.631

IT.DSM.7 5 Y 0.630

IT.DSM.8 10 Y 0.635

IT.DSM.9 20 Y 0.639

IT.DSM.10 W Y 0.641

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 16 / 21

Page 28: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Evaluation Task results

English

System F

EN.DSM.10 0.715

EN.DSM.5 0.687

UMCC-DLSI-2 0.685

UMCC-DLSI-3 0.680

UMCC-DLSI-1 0.677

MFS 0.656

DAEBAK 0.604

GETALP-BN-1 0.263

GETALP-BN-2 0.266

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 17 / 21

Page 29: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Evaluation Task results

Italian

System F

UMCC-DLSI-2 0.658

UMCC-DLSI-1 0.657

IT.DSM.10 0.641

IT.DSM.5 0.633

DAEBAK 0.613

MFS 0.572

GETALP-BN-2 0.325

GETALP-BN-1 0.324

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 18 / 21

Page 30: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Conclusions and Future Work Conclusions

Recap

• The proposed algorithm outperforms the simple Lesk one for both English andItalian

• The system without knowledge about sense distribution always outperform theMFS baseline

• For English the system obtained the best results in the SemEval-2013 Task 12 withor without sense distribution

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 19 / 21

Page 31: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

Conclusions and Future Work Future work

What’s next?

• Extend the evaluation to other languages

• Evaluate different DSMs and compositional approaches

• Adapt our approach to a specific domain

• Using a domain corpus for DSM building• Exploit a domain sense annotated corpus for sense distribution

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 20 / 21

Page 32: COLING 2014 - An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model

That’s all folks!

The system is available on linehttps://github.com/pippokill/lesk-wsd-dsm

A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 21 / 21