Natural Language Processing - Hasso Plattner Institute .Natural Language Processing Lexical Semantics

  • View
    214

  • Download
    0

Embed Size (px)

Text of Natural Language Processing - Hasso Plattner Institute .Natural Language Processing Lexical...

  • Natural Language ProcessingLexical Semantics

    Word Sense Disambiguation and Word SimilarityPotsdam, 31 May 2012

    Saeedeh MomtaziInformation Systems Group

    based on the slides of the course book

  • Outline

    1 Lexical SemanticsWordNet

    2 Word Sense Disambiguation

    3 Word Similarity

    Saeedeh Momtazi | NLP | 31.05.2012

    2

  • Outline

    1 Lexical SemanticsWordNet

    2 Word Sense Disambiguation

    3 Word Similarity

    Saeedeh Momtazi | NLP | 31.05.2012

    3

  • Word Meaning

    Considering the meaning(s) of a word in addition to its writtenform

    Word SenseA discrete representation of an aspect of the meaning of a word

    Saeedeh Momtazi | NLP | 31.05.2012

    4

  • Word

    LexemeAn entry in a lexicon consisting of a pair:a form with a single meaning representation

    Camel (animal)Camel (music band)

    LemmaThe grammatical form that is used to represent a lexeme

    Camel

    Saeedeh Momtazi | NLP | 31.05.2012

    5

  • Homonymy

    Words which have similar form but different meanings

    Camel (animal)Camel (music band) Homogra

    phs

    WriteRight Homoph

    one

    Saeedeh Momtazi | NLP | 31.05.2012

    6

  • Semantics Relations

    Realizing lexical relations among words

    Hyponymy (is a) {parent: hypernym, child: hyponym }dog & animal

    Meronymy (part of)arm & body

    Synonymyfall & autumn

    Antonymytall & short

    Relations are between sensesrather than words

    Saeedeh Momtazi | NLP | 31.05.2012

    7

  • Outline

    1 Lexical SemanticsWordNet

    2 Word Sense Disambiguation

    3 Word Similarity

    Saeedeh Momtazi | NLP | 31.05.2012

    8

  • WordNet

    A hierarchical database of lexical relations

    Three Separate sub-databasesNounsVerbsAdjectives and Adverbs

    Closed class words are not included

    Each word is annotated with a set of senses

    Available onlinehttp://wordnetweb.princeton.edu/perl/webwn

    Saeedeh Momtazi | NLP | 31.05.2012

    9

    http://wordnetweb.princeton.edu/perl/webwn

  • WordNet

    Number of words in WordNet 3.0

    Category EntryNoun 117,097Verb 11,488Adjective 22,141Adverb 4,061

    Average number of senses in WordNet 3.0

    Category SenseNoun 1.23Verb 2.16

    Saeedeh Momtazi | NLP | 31.05.2012

    10

  • Word Sense

    Synset (synonym set)

    Saeedeh Momtazi | NLP | 31.05.2012

    11

  • Word Relations (Hypernym)

    Saeedeh Momtazi | NLP | 31.05.2012

    12

  • Word Relations (Sister)

    Saeedeh Momtazi | NLP | 31.05.2012

    13

  • Outline

    1 Lexical SemanticsWordNet

    2 Word Sense Disambiguation

    3 Word Similarity

    Saeedeh Momtazi | NLP | 31.05.2012

    14

  • Applications

    Information retrievalMachine translationSpeech synthesis

    Saeedeh Momtazi | NLP | 31.05.2012

    15

  • Information retrieval

    Saeedeh Momtazi | NLP | 31.05.2012

    16

  • Machine translation

    Saeedeh Momtazi | NLP | 31.05.2012

    17

  • Example

    Sense: band 532736 Music NThe band made copious recordings now regarded as classic from1941 to 1950. These were to have a tremendous influence on theworldwide jazz revival to come During the war Lu led a 20 piecenavy band in Hawaii.

    Saeedeh Momtazi | NLP | 31.05.2012

    18

  • Example

    Sense: band 532838 Rubber-band NHe had assumed that so famous and distinguished a professor

    would have been given the best possible medical attention it wasthe sort of assumption young men make. Here suspended fromLewiss person were pieces of tubing held on by rubber bands anold wooden peg a bit of cork.

    Saeedeh Momtazi | NLP | 31.05.2012

    19

  • Example

    Sense: band 532734 Range N

    There would be equal access to all currencies financialinstruments and financial services dash and no majorconstitutional change. As realignments become more rare andexchange rates waver in narrower bands the system could evolveinto one of fixed exchange rates.

    Saeedeh Momtazi | NLP | 31.05.2012

    20

  • Word Sense Disambiguation

    InputA wordThe context of the wordSet of potential senses for the word

    OutputThe best sense of the word for this context

    Saeedeh Momtazi | NLP | 31.05.2012

    21

  • Approaches

    Thesaurus-based

    Supervised learning

    Semi-supervised learning

    Saeedeh Momtazi | NLP | 31.05.2012

    22

  • Thesaurus-based

    Extracting sense definitions from existing sourcesDictionariesThesauriWikipedia

    Saeedeh Momtazi | NLP | 31.05.2012

    23

  • Thesaurus-based

    Saeedeh Momtazi | NLP | 31.05.2012

    24

  • The Lesk Algorithm

    Selecting the sense whose definition shares the most wordswith the words context

    Simplified Algorithm [Kilgarriff and Rosenzweig, 2000]

    Saeedeh Momtazi | NLP | 31.05.2012

    25

  • The Lesk Algorithm

    Simple to implementNo training data neededRelatively bad results

    Saeedeh Momtazi | NLP | 31.05.2012

    26

  • Supervised Learning

    Training data:A corpus in which each occurrence of the ambiguous word w isannotated by its correct sense

    SemCor : 234,000 sense-tagged from Brown corpusSENSEVAL-1: 34 target wordsSENSEVAL-2: 73 target wordsSENSEVAL-3: 57 target words (2081 sense-tagged)

    Saeedeh Momtazi | NLP | 31.05.2012

    27

  • Feature Selection

    Using the words in the context with a specific window size

    CollocationConsidering all words in a window (as well as their POS) and theirposition

    Bag-of-wordConsidering the frequent words regardless their positionDeriving a set of k most frequent words in the window from thetraining corpusRepresenting each word in the data as a k -dimention vectorFinding the frequency of the selected words in the context of thecurrent observation

    Saeedeh Momtazi | NLP | 31.05.2012

    28

  • Collocation

    Sense: band 532734 Range N

    There would be equal access to all currencies financialinstruments and financial services dash and no majorconstitutional change. As realignments become more rare andexchange rates waver in narrower bands the system could evolveinto one of fixed exchange rates.

    Window size: +/- 3Context: waver in narrower bands the system could{Wn3, Pn3, Wn2, Pn2, Wn1, Pn1, Wn+1, Pn+1, Wn+2, Pn+2, Wn+3, Pn+3}{waver, NN, in , IN , narrower, JJ, the, DT, system, NN , could, MD}

    Saeedeh Momtazi | NLP | 31.05.2012

    29

  • Bag-of-word

    Sense: band 532734 Range N

    There would be equal access to all currencies financialinstruments and financial services dash and no majorconstitutional change. As realignments become more rare andexchange rates waver in narrower bands the system could evolveinto one of fixed exchange rates.

    Window size: +/- 3Context: waver in narrower bands the system couldk frequent words for band:{circle, dance, group, jewelery, music, narrow, ring, rubber, wave}{ 0 , 0 , 0 , 0 , 0 , 1 , 0 , 0 , 1 }

    Saeedeh Momtazi | NLP | 31.05.2012

    30

  • Nave Bayes Classification

    Choosing the best sense s out of all possible senses si for afeature vector ~f of the word w

    s = argmaxsi P(si |~f )

    s = argmaxsiP(~f |si) P(si)

    P(~f )

    P(~f ) has no effect

    s = argmaxsi P(~f |si) P(si)

    Saeedeh Momtazi | NLP | 31.05.2012

    31

  • Nave Bayes Classification

    s = argmaxsi P(si) P(~f |si)

    Likelihood

    ProbabilityPr

    ior

    Probability

    s = argmaxsi P(si) m

    j=1

    P(fj |si)

    P(si) =#(si)#(w)

    #(si): number of times the sense si is used for the word w in the training data#(w): the total number of samples for the word w

    Saeedeh Momtazi | NLP | 31.05.2012

    32

  • Nave Bayes Classification

    s = argmaxsi P(si) P(~f |si)

    Likelihood

    ProbabilityPr

    ior

    Probability

    s = argmaxsi P(si) m

    j=1

    P(fj |si)

    P(fj |si) =#(fj , si)#(si)

    #(fj , si): the number of times the feature fj occurred for the sense si of word w#(si): the total number of samples of w with the sense si in the training data

    Saeedeh Momtazi | NLP | 31.05.2012

    33

  • Semi-supervised Learning

    What is the best approach when we do not have enough datato train a model?

    Saeedeh Momtazi | NLP | 31.05.2012

    34

  • Semi-supervised Learning

    A small amount of labeled data

    A large amount of unlabeled data

    SolutionFinding the similarity between the labeled andunlabeled dataPredicting the labels of the unlabeled data

    Saeedeh Momtazi | NLP | 31.05.2012

    35

  • Semi-supervised Learning

    What is the best approach when we do not have enough datato train a model?

    For each sense,Select the most important word which frequently co-occurs withthe target word only for this particular senseFind the sentences from unlabeled data which contain the targetword and the selected wordLabel the sentence with the corresponding senseAdd the new labeled sentences to the training data

    Example for Band

    sense selected wordMusic playRubber elasticRange spectrum

    Saeedeh Momtazi | NLP | 31.05.2012

    36

  • Outline

    1 Lexical SemanticsWordNet

    2 Word Sense Disambiguation

    3 Word Similarity

    Saeedeh Momtazi | NLP | 31.05.2012

    37

  • Word Similarity

    Task

    Finding the similarity between two wordsCovering somewhat a wider range of relations in the meaning(different with synonymy)Being defined with a score (degree of similarity)

    ExampleBank (financial inst