69
1 1 Chapter 20 Part 3 Computational Lexical Semantics Acknowledgements: these slides include material from Dan Jurafsky, Rada Mihalcea, Ray Mooney, Katrin Erk, and Ani Nenkova

Chapter 20 Part 3

  • Upload
    eavan

  • View
    35

  • Download
    2

Embed Size (px)

DESCRIPTION

Chapter 20 Part 3. Computational Lexical Semantics Acknowledgements: these slides include material from Dan Jurafsky, Rada Mihalcea, Ray Mooney, Katrin Erk, and Ani Nenkova. 1. Similarity Metrics. Similarity metrics are useful not just for word sense disambiguation, but also for: - PowerPoint PPT Presentation

Citation preview

Page 1: Chapter 20 Part 3

11

Chapter 20Part 3

Computational Lexical Semantics

Acknowledgements: these slides include material from Dan Jurafsky, Rada Mihalcea, Ray Mooney, Katrin Erk, and Ani

Nenkova

Page 2: Chapter 20 Part 3

2

Similarity Metrics

• Similarity metrics are useful not just for word sense disambiguation, but also for:– Finding topics of documents– Representing word meanings, not with respect

to a fixed sense inventory

• We will start with dictionary based methods and then look at vector space models

Page 3: Chapter 20 Part 3

3

Thesaurus-based word similarity

• We could use anything in the thesaurus– Meronymy– Glosses– Example sentences

• In practice– By “thesaurus-based” we just mean

• Using the is-a/subsumption/hypernym hierarchy

• Can define similarity between words or between senses

Page 4: Chapter 20 Part 3

4

Path based similarity

• Two senses are similar if nearby in thesaurus hierarchy (i.e. short path between them)

Page 5: Chapter 20 Part 3

5

path-based similarity

• pathlen(c1,c2) = number of edges in the shortest path between the sense nodes c1 and c2

• wordsim(w1,w2) =– maxc1senses(w1),c2senses(w2) pathlen(c1,c2)

Page 6: Chapter 20 Part 3

6

Problem with basic path-based similarity

• Assumes each link represents a uniform distance

• But, some areas of WordNet are more developed than others

• Depended on the people who created it

• Also, links deep in the hierarchy are intuitively more narrow than links higher up [on slide 4, e.g., nickel to money vs nickel to standard]

Page 7: Chapter 20 Part 3

7

Information content similarity metrics

• Let’s define P(C) as:– The probability that a randomly selected word

in a corpus is an instance of concept c– A word is an instance of a concept if it appears

below the concept in the WordNet hierarchy– We saw this idea when we covered selectional

preferences

Page 8: Chapter 20 Part 3

8

In particular

– If there is a single node that is the ancestor of all nodes, then its probability is 1

– The lower a node in the hierarchy, the lower its probability

– An occurrence of the word dime would count towards the frequency of coin, currency, standard, etc.

Page 9: Chapter 20 Part 3

9

Information content similarity

• Train by counting in a corpus– 1 instance of “dime” could count toward

frequency of coin, currency, standard, etc

• More formally:

Here N is the total number of words (tokens) in the corpus that are also in the thesaurus

Page 10: Chapter 20 Part 3

10

Information content similarity

WordNet hierararchy augmented with probabilities P(C)

Page 11: Chapter 20 Part 3

11

Information content: definitions

• Information content:– IC(c)=-logP(c)

• Lowest common subsumer LCS(c1,c2) – I.e. the lowest node in the hierarchy– That subsumes (is a hypernym of)

both c1 and c2

Page 12: Chapter 20 Part 3

12

Resnik method

• The similarity between two senses is related to their common information

• The more two senses have in common, the more similar they are

• Resnik: measure the common information as:

– The info content of the lowest common subsumer of the two senses

– simresnik(c1,c2) = -log P(LCS(c1,c2))

Page 13: Chapter 20 Part 3

Example Use:

• Yaw Gyamfi, Janyce Wiebe, Rada Mihalcea, and Cem Akkaya (2009). Integrating Knowledge for Subjectivity Sense Labeling. HLT-NAACL 2009.

13

Page 14: Chapter 20 Part 3

What is Subjectivity?

• The linguistic expression of somebody’s opinions, sentiments, emotions, evaluations, beliefs, speculations (private states)

This particular use of subjectivity was adapted from literary theory Banfield 1982; Wiebe 1990

Page 15: Chapter 20 Part 3

Examples of Subjective Expressions

• References to private states– She was enthusiastic about the plan

• Descriptions– That would lead to disastrous consequences– What a freak show

Page 16: Chapter 20 Part 3

Subjectivity Analysis

• Automatic extraction of subjectivity (opinions) from text or dialog

Page 17: Chapter 20 Part 3

Subjectivity Analysis: Applications

• Opinion-oriented question answering: How do the Chinese regard the human rights record of the United States?

• Product review mining: What features of the ThinkPad T43 do customers like and which do they dislike?

• Review classification: Is a review positive or negative toward the movie?

• Tracking sentiments toward topics over time: Is anger ratcheting up or cooling down?

• Etc.

Page 18: Chapter 20 Part 3

Subjectivity Lexicons

• Most approaches to subjectivity and sentiment analysis exploit subjectivity lexicons. – Lists of keywords that have been gathered

together because they have subjective uses Brilliant

DifferenceHate

InterestLove

Page 19: Chapter 20 Part 3

Automatically Identifying Subjective Words

• Much work in this areaHatzivassiloglou & McKeown ACL97 Wiebe AAAI00Turney ACL02Kamps & Marx 2002Wiebe, Riloff, & Wilson CoNLL03Yu & Hatzivassiloglou EMNLP03Kim & Hovy IJCNLP05Esuli & Sebastiani CIKM05Andreevskaia & Bergler EACL06Etc.

Subjectivity Lexicon available at : http://www.cs.pitt.edu/mpqaEntries from several sources

Page 20: Chapter 20 Part 3

However…

• Consider the keyword “interest”

• It is in the subjectivity lexicon

• But, what about “interest rate,” for example?

Page 21: Chapter 20 Part 3

WordNet Senses

Interest, involvement -- (a sense of concern with

and curiosity about someone or something; "an interest in music")

Interest -- (a fixed charge for borrowing money;

usually a percentage of the amount borrowed; "how much interest do you pay on your mortgage?")

Page 22: Chapter 20 Part 3

WordNet Senses

Interest, involvement -- (a sense of concern with

and curiosity about someone or something; "an interest in music")

Interest -- (a fixed charge for borrowing money;

usually a percentage of the amount borrowed; "how much interest do you pay on your mortgage?")

S

O

Page 23: Chapter 20 Part 3

Senses

• Even in subjectivity lexicons, many senses of the keywords are objective

• Thus, many appearances of keywords in texts are false hits

Page 24: Chapter 20 Part 3

WordNet Miller 1995; Fellbaum 1998

Page 25: Chapter 20 Part 3

Examples

• “There are many differences between African and Asian elephants.”

• “… dividing by the absolute value of the difference from the mean…”

• “Their differences only grew as they spent more time together …”

• “Her support really made a difference in my life”• “The difference after subtracting X from Y…”

Page 26: Chapter 20 Part 3

Our Task: Subjectivity Sense Labeling

• Automatically classifying senses as subjective or objective

• Purpose: exploit labels to improve– Word sense diambiguation Wiebe and Mihalcea ACL06

– Automatic subjectivity and sentiment analysis systems Akkaya, Wiebe, Mihalcea (2009,2010,2011,2012,2014)

Page 27: Chapter 20 Part 3

SubjectivityOr Sentiment

Classifier

Subjectivity Tagging using Subjectivity WSD

SWSDSystem

S O?

Sense O {1, 2, 5}

Sense S {3,4}

S O?

Difference sense#1 O sense#2 O sense#3 S sense#4 S sense#5 O

“There are many differences between African and Asian elephants.”

“Their differences only grew as they spent more time together …”

Page 28: Chapter 20 Part 3

SubjectivityOr Sentiment

Classifier

Subjectivity Tagging using Subjectivity WSD

SWSDSystem

S O

Sense O {1, 2, 5}

Sense S {3,4}

S O

Difference sense#1 O sense#2 O sense#3 S sense#4 S sense#5 O

“There are many differences between African and Asian elephants.”

“Their differences only grew as they spent more time together …”

Page 29: Chapter 20 Part 3

Seed sense

LCS

Target sense

Using Hierarchical Structure

Page 30: Chapter 20 Part 3

Using Hierarchical Structure

voice#1 (objective)

LCS

Page 31: Chapter 20 Part 3

• If you are interested in the entire approach and experiments, please see the paper (it is on my website)

31

Page 32: Chapter 20 Part 3

Dekang Lin method

• Intuition: Similarity between A and B is not just what they have in common

• The more differences between A and B, the less similar they are:– Commonality: the more A and B have in common, the more similar they are

– Difference: the more differences between A and B, the less similar

• Commonality: IC(common(A,B))

• Difference: IC(description(A,B))-IC(common(A,B))

Dekang Lin. 1998. An Information-Theoretic Definition of Similarity. ICML

Page 33: Chapter 20 Part 3

Dekang Lin similarity theorem

• The similarity between A and B is measured by the ratio between the amount of information needed to state the commonality of A and B and the information needed to fully describe what A and B are

• Lin (altering Resnik) defines:

Page 34: Chapter 20 Part 3

Lin similarity function

Page 35: Chapter 20 Part 3

35

Summary: thesaurus-based similarity between senses

• There are many metrics (you don’t have to memorize these)

Page 36: Chapter 20 Part 3

Using Thesaurus-Based Similarity for WSD

• One specific method (Banerjee & Pedersen 2003):

• For sense k of target word t:– SenseScore[k] = 0– For each word w appearing within –N and +N

of t:• For each sense s of w:

– SenseScore[k] += similarity(k,s)

• The sense with the highest SenseScore is assigned to the target word

36

Page 37: Chapter 20 Part 3

Problems with thesaurus-based meaning

• We don’t have a thesaurus for every language• Even if we do, they have problems with recall

– Many words are missing

– Most (if not all) phrases are missing

– Some connections between senses are missing

– Thesauri work less well for verbs, adjectives

• Adjectives and verbs have less structured hyponymy relations

Page 38: Chapter 20 Part 3

Distributional models of meaning

• Also called vector-space models of meaning• Offer much higher recall than hand-built thesauri

– Although they tend to have lower precision

• Zellig Harris (1954): “oculist and eye-doctor … occur in almost the same environments…. If A and B have almost identical environments we say that they are synonyms.

• Firth (1957): “You shall know a word by the company it keeps!”

•38

Page 39: Chapter 20 Part 3

Intuition of distributional word similarity

• Nida example:A bottle of tesgüino is on the tableEverybody likes tesgüinoTesgüino makes you drunkWe make tesgüino out of corn.

• From context words humans can guess tesgüino means

– an alcoholic beverage like beer• Intuition for algorithm:

– Two words are similar if they have similar word contexts.

Page 40: Chapter 20 Part 3

Reminder: Term-document matrix

• Each cell: count of term t in a document d: tft,d: – Each document is a count vector: a column below

•40

Page 41: Chapter 20 Part 3

Reminder: Term-document matrix

• Two documents are similar if their vectors are similar

•41

Page 42: Chapter 20 Part 3

The words in a term-document matrix

• Each word is a count vector: a row below

•42

Page 43: Chapter 20 Part 3

The words in a term-document matrix

• Two words are similar if their vectors are similar

•43

Page 44: Chapter 20 Part 3

The Term-Context matrix

• Instead of using entire documents, use smaller contexts– Paragraph

– Window of 10 words

• A word is now defined by a vector over counts of context words

•44

Page 45: Chapter 20 Part 3

Sample contexts: 20 words (Brown corpus)

• equal amount of sugar, a sliced lemon, a tablespoonful of apricot preserve or jam, a pinch each of clove and nutmeg,

• on board for their enjoyment. Cautiously she sampled her first pineapple and another fruit whose taste she likened to that of

•45

• of a recursive type well suited to programming on the digital computer. In finding the optimal R-stage policy from that of

• substantially affect commerce, for the purpose of gathering data and information necessary for the study authorized in the first section of this

Page 46: Chapter 20 Part 3

Term-context matrix for word similarity

• Two words are similar in meaning if their context vectors are similar

•46

Page 47: Chapter 20 Part 3

Should we use raw counts?

• For the term-document matrix– We used tf-idf instead of raw term counts

• For the term-context matrix– Positive Pointwise Mutual Information (PPMI) is

common

•47

Page 48: Chapter 20 Part 3

Pointwise Mutual Information

• Pointwise mutual information: – Do events x and y co-occur more than if they were independent?

– PMI between two words: (Church & Hanks 1989)

– Do words x and y co-occur more than if they were independent?

– Positive PMI between two words (Niwa & Nitta 1994)

– Replace all PMI values less than 0 with zero

Page 49: Chapter 20 Part 3

Computing PPMI on a term-context matrix

• Matrix F with W rows (words) and C columns (contexts)

• fij is # of times wi occurs in context cj

•49

Page 50: Chapter 20 Part 3

p(w=information,c=data) =

p(w=information) =

p(c=data) =

•50

= .326/19

11/19 = .58

7/19 = .37

Page 51: Chapter 20 Part 3

•51

• pmi(information,data)= log2 (.32/(.37*.58)) =.58

Page 52: Chapter 20 Part 3

Weighing PMI

• PMI is biased toward infrequent events• Various weighting schemes help alleviate this

– See Turney and Pantel (2010)– Add-one smoothing can also help

•52

Page 53: Chapter 20 Part 3

Summary: vector space models

• Representing meaning through counts– Represent document/sentence/context through

content words• Proximity in semantic space ~

similarity between words

53

Page 54: Chapter 20 Part 3

Summary: vector space models

• Uses: – Search

– Inducing ontologies

– Modeling human judgments of word similarity

– Improve supervised word sense disambiguation

– Word-sense discrimination: cluster words based on vectors; the clusters may not correspond to any particular sense inventory

54

Page 55: Chapter 20 Part 3

55

SenseEval

• Standardized international “competition” on WSD.

• Organized by the Association for Computational Linguistics (ACL) Special Interest Group on the Lexicon (SIGLEX).– Senseval 1: 1998

– Senseval 2: 2001

– Senseval 3: 2004

– Senseval 4: 2007

Page 56: Chapter 20 Part 3

56

Senseval 1: 1998

• Datasets for– English– French – Italian

• Lexical sample in English– Noun: accident, behavior, bet, disability, excess, float, giant, knee,

onion, promise, rabbit, sack, scrap, shirt, steering– Verb: amaze, bet, bother, bury, calculate, consumer, derive, float,

invade, promise, sack, scrap, sieze– Adjective: brilliant, deaf, floating, generous, giant, modest, slight,

wooden– Indeterminate: band, bitter, hurdle, sanction, shake

• Total number of ambiguous English words tagged: 8,448

Page 57: Chapter 20 Part 3

57

Senseval 1 English Sense Inventory

• Senses from the HECTOR lexicography project.

• Multiple levels of granularity– Coarse grained (avg. 7.2 senses per word)– Fine grained (avg. 10.4 senses per word)

Page 58: Chapter 20 Part 3

58

Senseval Metrics

• Fixed training and test sets, same for each system.• System can decline to provide a sense tag for a

word if it is sufficiently uncertain.• Measured quantities:

– A: number of words assigned senses

– C: number of words assigned correct senses

– T: total number of test words

• Metrics:– Precision = C/A

– Recall = C/T

Page 59: Chapter 20 Part 3

59

Senseval 1 Overall English Results

Fine grained

precision (recall)

Course grained

precision (recall)

Human Lexicographer

Agreement

97% (96%) 97% (97%)

Most common

sense baseline

57% (50%) 63% (56%)

Best system 77% (77%) 81% (81%)

Page 60: Chapter 20 Part 3

60

Senseval 2: 2001

• More languages: Chinese, Danish, Dutch, Czech, Basque, Estonian, Italian, Korean, Spanish, Swedish, Japanese, English

• Includes an “all-words” task as well as lexical sample.

• Includes a “translation” task for Japanese, where senses correspond to distinct translations of a word into another language.

• 35 teams competed with over 90 systems entered.

Page 61: Chapter 20 Part 3

61

Senseval 2 Results

Page 62: Chapter 20 Part 3

62

Senseval 2 Results

Page 63: Chapter 20 Part 3

63

Senseval 2 Results

Page 64: Chapter 20 Part 3

64

Ensemble Models

• Systems that combine results from multiple approaches seem to work very well.

Training Data

System 1 System 2 System 3 . . . System n

Result 1 Result 2 Result 3 Result n

Combine Results(weighted voting)

Final Result

Page 65: Chapter 20 Part 3

65

Senseval 3: 2004

• Some new languages: English, Italian, Basque, Catalan, Chinese, Romanian

• Some new tasks– Subcategorization acquisition– Semantic role labelling– Logical form

Page 66: Chapter 20 Part 3

66

Senseval 3 English Lexical Sample

• Volunteers over the web used to annotate senses of 60 ambiguous nouns, adjectives, and verbs.

• Non expert lexicographers achieved only 62.8% inter-annotator agreement for fine senses.

• Best results again in the low 70% accuracy range.

Page 67: Chapter 20 Part 3

67

Senseval 3: English All Words Task

• 5,000 words from Wall Street Journal newspaper and Brown corpus (editorial, news, and fiction)

• 2,212 words tagged with WordNet senses.• Interannotator agreement of 72.5% for people with

advanced linguistics degrees.– Most disagreements on a smaller group of difficult

words. Only 38% of word types had any disagreement at all.

• Most-common sense baseline: 60.9% accuracy• Best results from competition: 65% accuracy

Page 68: Chapter 20 Part 3

68

Other Approaches to WSD

• Active learning

• Unsupervised sense clustering• Semi-supervised learning (Yarowsky 1995)

– Bootstrap from a small number of labeled examples to exploit unlabeled data

– Exploit “one sense per collocation” and “one sense per discourse” to create the labeled training data

Page 69: Chapter 20 Part 3

69

Issues in WSD

• What is the right granularity of a sense inventory?• Integrating WSD with other NLP tasks

– Syntactic parsing– Semantic role labeling– Semantic parsing

• Does WSD actually improve performance on some real end-user task?– Information retrieval– Information extraction– Machine translation– Question answering– Sentiment Analysis