32
Statistics-based Approaches to Lexical Semantics Martin Thorsen Ranang Department of Computer and Information Science (IDI) Trial Lecture, February 5th 2010 www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

  • Upload
    builiem

  • View
    222

  • Download
    1

Embed Size (px)

Citation preview

Page 1: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

Statistics-based Approaches to LexicalSemantics

Martin Thorsen Ranang

Department of Computer and Information Science (IDI)

Trial Lecture, February 5th 2010

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 2: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

2

OutlineIntroduction

What is Lexical Semantics?Natural Language Processing (NLP) ApplicationsMy PhD Research

Statistics-based Approaches to Lexical SemanticsWord Sense Disambiguation (WSD)Vector Space Model (VSM)Dimensionality ReductionOntology Merging and Alignment

Summary

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 3: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

3

OutlineIntroduction

What is Lexical Semantics?Natural Language Processing (NLP) ApplicationsMy PhD Research

Statistics-based Approaches to Lexical SemanticsWord Sense Disambiguation (WSD)Vector Space Model (VSM)Dimensionality ReductionOntology Merging and Alignment

Summary

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 4: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

4

Lexical Semantics

— “The study of how and what the words of a languagedenote.” (Pustejovsky, 1998)

— lexical semantic relations like: synonymy, antonymy (“close vs.distant”), hypo-/hypernymy (“car vs. vehicle”)

— polysemy (lexical ambiguity)— selectional restrictions: “Joe ate <. . . > in a hurry.”— Typical resources:

• Dictionaries, Machine Readable Dictionaries (MRDs) (Wilkset al., 1996)

• Ontologies and Semantic Networks

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 5: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

5

The Distributional Hypothesis

— “You shall know a word by the company it keeps.” Firth (1957).— “There is a positive relationship between the degree of

synonymy (semantic similarity) existing between a pair ofwords and the degree to which their contexts aresimilar.” (Rubenstein and Goodenough, 1965)

— “The meaning of entities, and the meaning of grammaticalrelations among them, is related to the restriction ofcombinations of these entities relative to otherentities.” (Harris, 1968)

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 6: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

6

OutlineIntroduction

What is Lexical Semantics?Natural Language Processing (NLP) ApplicationsMy PhD Research

Statistics-based Approaches to Lexical SemanticsWord Sense Disambiguation (WSD)Vector Space Model (VSM)Dimensionality ReductionOntology Merging and Alignment

Summary

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 7: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

7

Example Areas

— Word Sense Disambiguation (WSD)— Natural Language Understanding (NLU) and Text

Interpretation (TI)— Machine Translation (MT)— Information Retrieval (IR)

What parts of of Natural Language Processing (NLP) are notaffected by Lexical Semantics?

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 8: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

8

OutlineIntroduction

What is Lexical Semantics?Natural Language Processing (NLP) ApplicationsMy PhD Research

Statistics-based Approaches to Lexical SemanticsWord Sense Disambiguation (WSD)Vector Space Model (VSM)Dimensionality ReductionOntology Merging and Alignment

Summary

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 9: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

9

My PhD Research

— Developed a method for automatically mapping words fromlanguages other than English to concepts in the PrincetonWordNet by Miller et al. (1990); Fellbaum (1998)

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 10: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

10

WordNet Example

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 11: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

11

Why Statistics-based?

— Frequencies of actual language usage— Adapts to changes of the above— Well suited to provide generalizations and to summarize

features of huge text corpora.

(Manning and Schütze, 1999)

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 12: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

12

OutlineIntroduction

What is Lexical Semantics?Natural Language Processing (NLP) ApplicationsMy PhD Research

Statistics-based Approaches to Lexical SemanticsWord Sense Disambiguation (WSD)Vector Space Model (VSM)Dimensionality ReductionOntology Merging and Alignment

Summary

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 13: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

13

Word Sense Disambiguation (WSD)

Bass

Morone saxatilis

Tones of lowfrequency

Marchione bassguitar

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 14: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

14

Usage Context

— “He fished for bass using scented attractants.”— “Joe played the bass fluently, while George played the piano.”— “When the neighbors play their music I can’t hear the tune but

can hear the bass tones.”

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 15: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

15

Word Sense Disambiguation (WSD)

— Two main approaches:Integrated approach: postponed until semantic analysis;

elimination of ill-formed semantic representationsStand-alone approach: independent of, and prior to

compositional semantic analysis; more oftenstatistics-based

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 16: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

16

Statistics-based Stand-aloneApproaches I

Supervised learning

Training: sense-tagged corpus; naïve Bayesianclassifiers; feature vectors; “slidingwindow”Feature vectors represent local context,and may include words and POS.

Application: Use the trained classifier on unseenambiguous words, given a local-contextfeature vector

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 17: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

17

Statistics-based Stand-aloneApproaches II

Bootstrappingsmall number of training instances used as seeds;classifier trained through supervised learning

Unsupervised disambiguationsense-discrimination, not sense tagging; groups ofsimilar words, based on their local-context

Dictionary-based approachCount overlap between sliding window and dictionarydefinition of candidate senses.

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 18: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

18

OutlineIntroduction

What is Lexical Semantics?Natural Language Processing (NLP) ApplicationsMy PhD Research

Statistics-based Approaches to Lexical SemanticsWord Sense Disambiguation (WSD)Vector Space Model (VSM)Dimensionality ReductionOntology Merging and Alignment

Summary

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 19: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

19

Vector Space Model (Salton, 1971)Term Frequency:

tfi,j =ni,j∑k nk ,j

Inverse Document Frequency:

idfi = log|D|

|{d : ti ∈ d}|Vector elements:

wi,j = tfi,j · idfiWeight vector for doc d :

vd =[w1,d , w2,d , . . . , wN,d ]T

v1 v2 . . . vd2664w1,1 w1,2 . . . w1,dw2,1 w2,2 . . . w2,d. . . . . . . . . . . . . . . . . . . . . . .wN,1 wN,2 . . . wN,d

3775

Importance of term ito doc j

Common words areless descriptive

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 20: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

20

Vector Space ModelAstronaut

RocketCosmonaut

— Enables comparison with other documents, based on content.— Does it really describe a document’s meaning?— Restrictions?

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 21: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

21

Semantic Augmentation of the VectorSpace Model

Several attempts to improve document retrieval efficiency byincorporating lexical semantic information:

— Voorhees (1994, 1998)— Moldovan and Mihalcea (2000)— Buscaldi et al. (2005)

No, or small, improvements to IR; some improvement for documentclassification.

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 22: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

22

OutlineIntroduction

What is Lexical Semantics?Natural Language Processing (NLP) ApplicationsMy PhD Research

Statistics-based Approaches to Lexical SemanticsWord Sense Disambiguation (WSD)Vector Space Model (VSM)Dimensionality ReductionOntology Merging and Alignment

Summary

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 23: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

23

Latent Semantic Analysis (LSA) /Indexing (LSI)

— Discrete entities are mapped onto a continuous vector space;— the mapping is determined by global correlation patterns; and— Dimensionality reduction is an integral part of the process

(Landauer and Dumais, 1997; Ando, 2000; Bellegarda, 2007)

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 24: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

24

Dimensionality Reduction— Singular Value Decomposition

Rocket{0.65 Cosmonaut,

0.35 Astronaut}

Quantitative evaluation of different semantic word space models:Van de Cruys (2010)

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 25: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

25

OutlineIntroduction

What is Lexical Semantics?Natural Language Processing (NLP) ApplicationsMy PhD Research

Statistics-based Approaches to Lexical SemanticsWord Sense Disambiguation (WSD)Vector Space Model (VSM)Dimensionality ReductionOntology Merging and Alignment

Summary

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 26: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

26

Ontology Matching

— Lacher and Groh (2001) used signature tfidf vectors forcomputing similarity between two ontology nodes.

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 27: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

27

Summary

— Lexical semantics— How this relates to my PhD research— Examples of statistics-based approaches to Lexical

Semantics, including:• different Word Sense Disambiguation techniques• semantic augmentation of the vector space model• how LSA/dimensionality reduction of vector spaces handles

synonymy• how statistics-based similarity measures are used to align and

merge ontologies

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 28: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

28

References I

Ando, Rie Kubota. 2000. Latent semantic space: Iterative scalingimproves precision of inter-document similarity measurement. InSIGIR’00.

Bellegarda, Jerome R. 2007. Latent Semantic Mapping: Principles& Applications, vol. 3 of Synthesis Lectures on Speech andAudio Processing. Morgan & Claypool Publishers.

Buscaldi, D., P. Rosso, and E.S. Arnal. 2005. A WordNet-basedquery expansion method for geographical information retrieval.In Working Notes for the CLEF Workshop.

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 29: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

29

References II

Van de Cruys, Tim. 2010. A quantitative evaluation of semanticword space models. In Computational Linguistics In TheNetherlands (CLIN) 20. Utrecht, Netherlands.

Fellbaum, Christiane, ed. 1998. WordNet: An electronic lexicaldatabase. Language, Speech, and Communication, Cambridge,Massachusetts, USA: The MIT Press.

Firth, John Rupert. 1957. Papers in linguistics 1934–1951. Oxford,UK: Oxford University Press.

Harris, Zellig Sabbettai. 1968. Mathematical structures oflanguage. Krieger Publishing Company.

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 30: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

30

References IIILacher, Martin S., and Georg Groh. 2001. Facilitating the

exchange of explicit knowledge through ontology mappings. InProceedings of the fourteenth international florida artificialintelligence research society conference, 305–309. AAAI Press.

Landauer, Thomas K., and Susan T. Dumais. 1997. A solution toPlato’s problem: The latent semantic analysis theory ofacquisition, induction and representation of knowledge.Psychological Review (104):211–240.

Manning, Christopher D., and Hinrich Schütze. 1999. Foundationsof statistical natural language processing. Cambridge,Massachusetts, USA: The MIT Press.

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 31: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

31

References IVMiller, George A., Richard Beckwith, Christiane Fellbaum, Derek

Gross, and Katherine J. Miller. 1990. Introduction to WordNet:an on-line lexical database. International Journal ofLexicography 3(4):235–244. (Revised August 1993).

Moldovan, Dan I., and Rada Mihalcea. 2000. Using WordNet andlexical operators to improve Internet searches. InternetComputing, IEEE 4:34–43.

Pustejovsky, James. 1998. The generative lexicon. Cambridge,Massachusetts, USA: The MIT Press.

Rubenstein, Herbert, and John B. Goodenough. 1965. Contextualcorrelates of synonymy. Commun. ACM 8(10):627–633.

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics

Page 32: Statistics-based Approaches to Lexical Semantics - · PDF fileStatistics-based Approaches to Lexical ... Statistics-based Approaches to Lexical Semantics. 5 ... Harris, Zellig Sabbettai

32

References VSalton, Gerard, ed. 1971. The smart retrieval system: Experiments

in automatic document processing. Englewood Cliffs, NJ:Prentice-Hall.

Voorhees, Ellen M. 1994. Query expansion using lexical-semanticrelations. In SIGIR’94: Proceedings of the 17th AnnualInternational ACM SIGIR Conference on Research andDevelopment in Information Retrieval, 61–69.

———. 1998. Using WordNet for text retrieval. In Fellbaum (1998),chap. 12, 285–304.

Wilks, Yorick, Louise Guthrie, and Brian M. Slator. 1996. Electricwords: Dictionaries, computers, and meanings. Cambridge,Massachusetts, USA: The MIT Press.

www.ntnu.no Martin Thorsen Ranang, Statistics-based Approaches to Lexical Semantics