Upload
pierpaolo-basile
View
215
Download
0
Embed Size (px)
DESCRIPTION
An Enhanced Lesk Word Sense Disambiguation Algorithm through a Distributional Semantic Model
Citation preview
An Enhanced Lesk Word Sense DisambiguationAlgorithm through a Distributional Semantic Model
Pierpaolo Basile, Annalina Caputo and Giovanni Semeraro
[email protected] of Computer Science - SWAP Research Group
University of Bari Aldo Moro (ITALY)
Coling 2014, Dublin, 27th-29th August 2014
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 1 / 21
Motivations Problem
One word... many meanings
BANK1 Sloping land (especially the slope beside a body of water)
2 A financial institution that accepts deposits and channels the money into lendingactivities
3 A long ridge or pile
4 ...
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 2 / 21
Motivations Lesk WSD
Simple Lesk approach
Insight
Select the meaning whose gloss maximizes the context overlap
Example
The bank keeps my money
1 Sloping land (especially the slope beside a body of water)
2 A financial institution that accepts deposits and channels the money into lendingactivities
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 3 / 21
Motivations Lesk WSD
Simple Lesk approach
Insight
Select the meaning whose gloss maximizes the context overlap
Example
The bank keeps my money
1 Sloping land (especially the slope beside a body of water) ⇒ overlap=0
2 A financial institution that accepts deposits and channels the money into lendingactivities ⇒ overlap=1
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 3 / 21
Motivations Lesk WSD
Simple Lesk approach
Issues
1 Sense definition is short ⇒ Reduced chances of matching
2 Overlap based on string matching ⇒ Semantically related words are considereddifferently
3 No knowledge about senses usage
Lesk mismatch
Sentence to disambiguate
he cashed a check at the bank
Right sense definition
A financial institution that accepts deposits
and channels the money into lending activities
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 4 / 21
Motivations Lesk WSD
Simple Lesk approach
Issues
1 Sense definition is short ⇒ Reduced chances of matching
2 Overlap based on string matching ⇒ Semantically related words are considereddifferently
3 No knowledge about senses usage
Lesk mismatch
Sentence to disambiguate
he cashed a check at the bank
Right sense definition
A financial institution that accepts deposits
and channels the money into lending activities
overlap=0
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 4 / 21
Solution Distributional Lesk
Idea
Solutions
1 Sense definition is short ⇒ Gloss expansion through related meanings
2 Overlap is based on string matching ⇒ Similarity computed in a WordSpace
3 No knowledge about senses usage ⇒ Exploiting sense annotated corpus
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 5 / 21
Solution Distributional Lesk
Idea
Solutions
1 Sense definition is short ⇒ Gloss expansion through related meanings
Gloss Expansion
Sentence to disambiguate
he cashed a check at the bank
A financial institution that accepts deposits and channels the money into lending activities+
A financial institution that accepts demand deposits and makes loans and provides other servicesfor the public... One of 12 regional banks that monitor and act as depositories for banks in theirregion... A corporation gaining financial control over another corporation or financial institution
through a payment in cash or an exchange of stock...
overlap=1
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 5 / 21
Solution Distributional Lesk
Idea
Solutions
2 Overlap is based on string matching ⇒ Similarity computed in a WordSpace
Gloss Expansion
Sentence to disambiguate
that bank holds the mortgage on my home
A financial institution that accepts deposits and channels the money into lending activities+
A financial institution that accepts demand deposits and makes loans and provides other servicesfor the public... One of 12 regional banks that monitor and act as depositories for banks in theirregion... A corporation gaining financial control over another corporation or financial institution
through a payment in cash or an exchange of stock...
overlap=0
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 5 / 21
Solution Distributional Lesk
Idea
Solutions
2 Overlap is based on string matching ⇒ Similarity computed in a WordSpace
Gloss Expansion
Sentence to disambiguate
that bank holds the mortgage on my home
A financial institution that accepts deposits and
channels the money into lending activities + A
financial institution that accepts demand
deposits and makes loans and provides other
services for the public... One of 12 regional
banks that monitor and act as depositories for
banks in their region... A corporation gaining
financial control over another corporation or
financial institution through a payment in cash
or an exchange of stock... sloping
sideground
mortgageloans
lending
depositmoney
financial
cash
payment
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 5 / 21
Solution Gloss expansion
Gloss expansion
Leavening on a semantic network
• Concatenate recursively glosses of related synsets until a depth d is reached
• Exclude “antonym” relation
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 6 / 21
Solution Gloss expansion
Term weighting
Idea
Term relevance depends on both its frequency and the distance d of the related synset
Solutions
Inverse gloss frequency (IGF ) Words occurring in all the extended glosses associatedwith the target word poorly characterize the meaning description
Distance weight Inversely proportional to the distance in the network (number of edges)between the target synset and the related synset
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 7 / 21
Solution Gloss expansion
Term weighting
Idea
Term relevance depends on both its frequency and the distance d of the related synset
Solutions
Inverse gloss frequency (IGF ) Words occurring in all the extended glosses associatedwith the target word poorly characterize the meaning description
Distance weight Inversely proportional to the distance in the network (number of edges)between the target synset and the related synset
Bank
1 Sloping land (especially the slope beside a body of water)
2 A financial institution that accepts deposits and channels the money into lendingactivities...
8 A container (usually with a slot in the top) for keeping money at home
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 7 / 21
Solution Gloss expansion
Term weight
Inverse gloss frequency
IGFk = 1 + log2|Si |gf ∗k
(1)
gf ∗k is the number of extended glosses that contain a word wk
Term weight
Weight for word wk appearing h times in the extended gloss g∗ij is given by
weight(wk , g∗ij ) =
h∑ 1
1 + d× IGFk (2)
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 8 / 21
Solution WordSpace
Distributional Semantic Models (DSMs)
• You shall know a word by thecompany it keeps!
• Words are represented aspoints in a geometric space
• Words are related if they areclose in that space
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 9 / 21
Solution WordSpace
Overlap in DSM
• Gloss as a vector: weighted vector sum of terms occurring in the expanded gloss
• Context as a vector: vector sum of the target surrounding words
• Compute the overlap as the cosine similarity between gloss vector and contextvector
bank hold mortgage home
financial institution accept deposit channel money lend activity...
sloping land especially slope beside bodywater...
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 10 / 21
Solution Sense Distribution
Sense distribution
Insight
Analyze the distribution of meanings according to each word
Solution
p(sij |wi ) =t(wi , sij) + 1
#wi + |Si |(3)
t(wi , sij ): number of times the word wi is tagged with sij#wi : number of occurrences of wi
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 11 / 21
Solution Methodology
Shaking the ingredients
1 For each word retrieve the list of meanings
2 Expand the glosses and build for each expanded gloss the corresponding vector
3 Create the context vector considering surrounding words
4 Compute the overlap in DSM
5 Combine the overlap with sense distribution
6 Select the meaning whose extended gloss has the maximum overlap
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 12 / 21
Solution Methodology
Shaking the ingredients
1 For each word retrieve the list of meanings
2 Expand the glosses and build for each expanded gloss the corresponding vector
3 Create the context vector considering surrounding words
4 Compute the overlap in DSM
5 Combine the overlap with sense distribution
6 Select the meaning whose extended gloss has the maximum overlap
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 12 / 21
Solution Methodology
Shaking the ingredients
1 For each word retrieve the list of meanings
2 Expand the glosses and build for each expanded gloss the corresponding vector
3 Create the context vector considering surrounding words
4 Compute the overlap in DSM
5 Combine the overlap with sense distribution
6 Select the meaning whose extended gloss has the maximum overlap
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 12 / 21
Solution Methodology
Shaking the ingredients
1 For each word retrieve the list of meanings
2 Expand the glosses and build for each expanded gloss the corresponding vector
3 Create the context vector considering surrounding words
4 Compute the overlap in DSM
5 Combine the overlap with sense distribution
6 Select the meaning whose extended gloss has the maximum overlap
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 12 / 21
Solution Methodology
Shaking the ingredients
1 For each word retrieve the list of meanings
2 Expand the glosses and build for each expanded gloss the corresponding vector
3 Create the context vector considering surrounding words
4 Compute the overlap in DSM
5 Combine the overlap with sense distribution
6 Select the meaning whose extended gloss has the maximum overlap
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 12 / 21
Solution Methodology
Shaking the ingredients
1 For each word retrieve the list of meanings
2 Expand the glosses and build for each expanded gloss the corresponding vector
3 Create the context vector considering surrounding words
4 Compute the overlap in DSM
5 Combine the overlap with sense distribution
6 Select the meaning whose extended gloss has the maximum overlap
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 12 / 21
Evaluation Goal
Evaluation
Goals
Comparing our system with respect to
1 Simplified Lesk approach
2 Other task participants
Evaluate the system with and without sense distribution
• Sense distribution linearly combined with the cosine similarity score
Dataset
• Dataset: Task-12 of SemEval-2013 Multilingual Word Sense Disambiguation
• Sense inventory: BabelNet
• Metrics: F-measure
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 13 / 21
Evaluation System setup
System setup
• Developed in JAVA relying on BabelNet API 1.1.11
• Lucene analyzer to tokenize both glosses and the context, Snowball library2
stemming
• Latent Semantic Analysis for building DSM considering the most 100, 000 frequent
words
• BNC corpus for English• Wikipedia dump for Italian
• Synset distance d is set to 1
• Several context dimension: 3, 5, 10, 20 and the whole text
• Combination factor for cosine similarity and sense distribution: 0.5
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 14 / 21
Evaluation Results
English
Run ContextSize SenseDistr . F
MFS - - 0.656
EN.LESK.1 3 N 0.525
EN.LESK.6 3 Y 0.633
EN.DSM.1 3 N 0.536
EN.DSM.2 5 N 0.605
EN.DSM.3 10 N 0.633
EN.DSM.4 20 N 0.650
EN.DSM.5 W N 0.687
EN.DSM.6 3 Y 0.669
EN.DSM.7 5 Y 0.677
EN.DSM.8 10 Y 0.689
EN.DSM.9 20 Y 0.696
EN.DSM.10 W Y 0.715
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 15 / 21
Evaluation Results
Italian
Run ContextSize SenseDistr . F
MFS - - 0.572
IT.LESK.2 5 N 0.530
IT.LESK.10 W Y 0.607
IT.DSM.1 3 N 0.610
IT.DSM.2 5 N 0.607
IT.DSM.3 10 N 0.626
IT.DSM.4 20 N 0.628
IT.DSM.5 W N 0.633
IT.DSM.6 3 Y 0.631
IT.DSM.7 5 Y 0.630
IT.DSM.8 10 Y 0.635
IT.DSM.9 20 Y 0.639
IT.DSM.10 W Y 0.641
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 16 / 21
Evaluation Task results
English
System F
EN.DSM.10 0.715
EN.DSM.5 0.687
UMCC-DLSI-2 0.685
UMCC-DLSI-3 0.680
UMCC-DLSI-1 0.677
MFS 0.656
DAEBAK 0.604
GETALP-BN-1 0.263
GETALP-BN-2 0.266
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 17 / 21
Evaluation Task results
Italian
System F
UMCC-DLSI-2 0.658
UMCC-DLSI-1 0.657
IT.DSM.10 0.641
IT.DSM.5 0.633
DAEBAK 0.613
MFS 0.572
GETALP-BN-2 0.325
GETALP-BN-1 0.324
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 18 / 21
Conclusions and Future Work Conclusions
Recap
• The proposed algorithm outperforms the simple Lesk one for both English andItalian
• The system without knowledge about sense distribution always outperform theMFS baseline
• For English the system obtained the best results in the SemEval-2013 Task 12 withor without sense distribution
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 19 / 21
Conclusions and Future Work Future work
What’s next?
• Extend the evaluation to other languages
• Evaluate different DSMs and compositional approaches
• Adapt our approach to a specific domain
• Using a domain corpus for DSM building• Exploit a domain sense annotated corpus for sense distribution
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 20 / 21
That’s all folks!
The system is available on linehttps://github.com/pippokill/lesk-wsd-dsm
A. Caputo ([email protected]) Lesk-DSM Coling 2014 - 28 Aug. 2014 21 / 21