147
Measuring the Similarity and Relatedness of Concepts : a MICAI 2013 Tutorial Ted Pedersen, Ph.D. University of Minnesota Department of Computer Science, Duluth http://www.d.umn.edu/~tpederse [email protected]

MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

Embed Size (px)

DESCRIPTION

These are the slides used in my tutorial at MICAI 2013 (presented November 25, 2013). The tutorial is on Measuring the Similarity and Relatedness of Concepts, and focuses on methods that rely on information from ontologies, possibly augmented with corpus statistics. This tutorial includes background on the measures, plus information on software that implements different measures, plus reference standard data sets that can be used to evaluate measures.

Citation preview

Page 1: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

Measuring the Similarity and Relatedness of Concepts :

a MICAI 2013 Tutorial

Ted Pedersen, Ph.D.

University of MinnesotaDepartment of Computer Science, Duluth

http://www.d.umn.edu/[email protected]

Page 2: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 2

What (I hope) you will learn!● The distinction between semantic similarity

and relatedness (and why both are useful)● How to measure using information from

ontologies, definitions, and corpora● How to use freely available software ● How to conduct experiments using freely

available reference standards● Some applications where these measures are

used or could be useful

Page 3: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 3

Orientation

● We focus on methods that measure similarity and relatedness using information found in an ontology, which may be possibly augmented with statistics from corpora or other resources

● We will not discuss purely distributional methods

– Very interesting and useful, and deserve their own separate tutorial

Page 4: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 4

Just a few distributional methods

● Latent Semantic Analysis – http://lsa.colorado.edu

● SenseClusters– http://senseclusters.sourceforge.net

● Clustering by Committee– http://demo.patrickpantel.com

● Disco– http://www.linguatools.de/disco/disco_en.html

Page 5: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 5

Outline● Measures of Similarity and Relatedness

– 75 minutes + 10 minutes of questions

● Using Open Source Software– 45 minutes + 10 minutes of questions

● Similarity and Relatedness in the Wild– 60 minutes + 10 minutes of questions

Page 6: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 6

Measures of Semantic Similarity and Relatedness

(without tears)

Page 7: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 7

What are we measuring?● Concept pairs (word senses)

– Assign a numeric value that quantifies how similar or related two concepts are

● Not words– Cold may be temperature or illness

● Word Sense Disambiguation

– This tutorial assumes senses assigned ● But, can also use these measures for WSD!

Page 8: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 8

Why?

● Being able to organize concepts by their similarity or relatedness to each other is a fundamental operation in the human mind, and in many problems in Natural Language Processing and Artificial Intelligence

● If we know a lot about X, and if we know Y is similar to X, then a lot of what we know about X may apply to Y

– Use X to explain or categorize Y

Page 9: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 9

What is lefse?

Page 10: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 10

Well, it's like a tortilla, except made with potatoes.

Page 11: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 11

Lefse is a traditional soft, Norwegian flatbread. Lefse is made out of flour, and milk or cream (or

sometimes lard), and cooked on a griddle. Traditional lefse does not include potato, but it is commonly added to make a thicker dough that is easier to work with. Special tools are available for lefse baking, including long wooden turning sticks

and special rolling pins with deep grooves.

Page 12: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 12

Lefse

Page 13: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 13

Similar or Related?

● Similarity based on is-a relations– How much is X like Y?

– Share ancestor in is-a hierarchy

Page 14: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 14

Is-a hierarchy

Page 15: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 15

Similar or Related?

● Similarity based on is-a relations– Share ancestor in is-a hierarchy

● LCS : least common subsumer● Closer / deeper the ancestor the more similar

– A miter saw and a sander are similar● both are kinds-of power tools (LCS)

Page 16: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 16

Least Common Subsumer (LCS)

Page 17: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 17

Saws!

Page 18: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 18

Sanders!

Page 19: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 19

Similar or Related?● Relatedness more general

– How much is X related to Y?

– Many ways to be related● is-a, part-of, treats, affects, symptom-of, ...

● Hammer and nail are related but they really aren't similar

– (use hammer to drive nails)

● All similar concepts are related, but not all related concepts are similar

Page 20: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 20

“Standard” Measures of Similarity● Path Based

– Rada et al., 1989 (path)

● Path + Depth – Wu & Palmer, 1994 (wup)

– Leacock & Chodorow, 1998 (lch)

● Path + Information Content– Resnik, 1995 (res)

– Jiang & Conrath, 1997 (jcn)

– Lin, 1998 (lin)

Page 21: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 21

Path Based Measures● Distance between concepts (nodes) in tree

intuitively appealing● Spatial orientation, good for networks or maps

but not is-a hierarchies– Reasonable approximation sometimes

– Assumes all paths have same “weight”

– But, more specific (deeper) paths tend to travel less semantic distance

● Shortest path a good start, needs corrections

Page 22: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 22

Shortest is-a Path

1● path(a,b) = ------------------------------

shortest is-a path(a,b)

Page 23: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 23

We count nodes...

● Maximum = 1 – self similarity

– path(miter saw, miter saw) = 1

● Minimum = 1 / (longest path in isa tree)– path(miter saw, claw hammer) = 1/7

– path(table saw, flat tip screwdriver) = 1/7

– etc...

Page 24: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 24

path(miter saw, sander) = .25

Page 25: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 25

miter saw & sander

Page 26: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 26

path (hammer, power tool) = .25

Page 27: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 27

hammers and power tools

Page 28: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 28

?

● Are hammer and power tool similar to the same degree as are mitre saw and sander?

● The path measure reports “yes, they are.”

Page 29: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 29

Path + Depth

● Path only doesn't account for specificity● Deeper concepts more specific● Paths between deeper concepts travel less

semantic distance

Page 30: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 30

Wu and Palmer, 1994

2 * depth (LCS (a,b))● wup(a,b) = ----------------------------

depth (a) + depth (b)

● depth(x) = shortest is-a path(root,x)

Page 31: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 31

wup(miter saw, sander) = (2*2)/(4+3) = .57

Page 32: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 32

wup (hammer, power tool) = (2*1)/(2+3) = .4

Page 33: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 33

?● Wu and Palmer reports that sander and miter

saw (.57) are more similar than are power tool and hammer (.4)

● Path reports that sander and miter saw (.25) are equally similar as are power tool and hammer (.25)

● Note that measures are scaled differently and so should compare relative rankings between measures (and not exact scores)

Page 34: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 34

Information Content

● ic(concept) = -log p(concept) [Resnik 1995]– Need to count concepts

– Term frequency +Inherited frequency

– p(concept) = tf + if / N

● Depth shows specificity but not frequency● Low frequency concepts often much more

specific than high frequency ones

Page 35: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 35

Information Content term frequency (tf)

Page 36: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 36

Information Content inherited frequency (if)

Page 37: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 37

Information Content (IC = -log (f/N) final count (f = tf + if, N = 365,820)

Page 38: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 38

Information Content (IC = -log (f/N) final count (f = tf + if, N = 365,820)

Page 39: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 39

Lin, 1998

2 * IC (LCS (a,b))● lin(a,b) = --------------------------

IC (a) + IC (b)

● Look familiar?

Page 40: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 40

Wu & Palmer, 1994

2 * depth (LCS (a,b))● wup(a,b) = --------------------------

depth (a) + depth (b)

● wup and lin are identical except that lin uses info content instead of depth – Info content provides a measure of

depth (based on specificity)

Page 41: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 41

lin (miter saw, sander) = 2 * 2.26 / (3.60 + 3.66) = 0.62

Page 42: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 42

lin (hammer, power tool) = 2 * 0.71 / (2.26+2.81) = 0.28

Page 43: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 43

?

● Lin : miter saw and sander (.62) more similar than hammer and power tool (.28)

● Wu and Palmer : miter saw and sander (.57) more similar than hammer and power tool (.4)

● Path miter saw and sander (.25) equally similar to hammer and power tool (.25)

Page 44: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 44

What about concepts not connected via is-a relations?

● Connected via other relations?– Part-of, treatment-of, causes, etc.

● Not connected at all?– In different sections (axes) of an ontology

– In different ontologies entirely

● Relatedness!– Use definition information

– No is-a relations so can't be similarity

Page 45: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 45

Measures of relatedness● Path based

– Hirst & St-Onge, 1998 (hso)

● Definition based– Lesk, 1986

– Adapted lesk (lesk)● Banerjee & Pedersen, 2003

● Definition + corpus– Gloss Vector (vector, vector_pairs)

● Patwardhan & Pedersen, 2006

Page 46: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 46

Path based relatedness

● Ontologies include relations other than is-a● These can be used to find shortest paths

between concepts– However, a path made up of different kinds

of relations can lead to big semantic jumps

– A hammer is used to drive nails which are made of iron which comes from mines in Minnesota

● …. so hammer and Minnesota are related ??

Page 47: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 47

Measuring relatedness with definitions

● Related concepts defined using many of the same terms

● But, definitions are short, inconsistent● Concepts don't need to be connected via

relations or paths to measure them– Lesk, 1986

– Adapted Lesk, Banerjee & Pedersen, 2003

Page 48: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 48

Two separate ontologies...

Page 49: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 49

Could join them together … ?

Page 50: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 50

Each concept has definition

Page 51: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 51

Each concept has definition

Page 52: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 52

Each concept has definition

Page 53: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 53

Overlaps● Claw hammer and carpenter

– Related by working with wood● Can't see this in structure of is-a hierarchies● Claw hammer and iron worker just as similar

● Ball peen hammer and claw hammer– Reflects structure of is-a hierarchies

– If you start with text like this maybe you can build is-a hierarchies automatically!

● Another tutorial...

Page 54: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 54

Lesk and Adapted Lesk● Lesk, 1986 : measure overlaps in definitions to

assign senses to words– The more overlaps between two senses

(concepts), the more related

● Banerjee & Pedersen, 2003, Adapted Lesk– Augment definition of each concept with

definitions of related concepts● Build a super gloss

– Increase chance of finding overlaps

Page 55: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 55

The problem with definitions ...● Definitions contain variations of terminology that

make it impossible to find exact overlaps● spatula : an instrument for spreading material● spreader : a hand tool for smoothing compounds● No matches??! How can we see that “hand tool”

and “instrument” are similar, as are “spreading material” and “smoothing compound” ?

Page 56: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 56

Gloss Vector Measure of Semantic Relatedness

● Rely on co-occurrences of terms – Terms that occur within some given number

of terms of each other other

● Allows for a fuzzier notion of matching● Exploits second order co-occurrences

– Friend of a friend relation

Page 57: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 57

Gloss Vector Measure of Semantic Relatedness

● Friend of a friend relation– Suppose hand tool and instrument don't

occur in text with each other. But, suppose that “repair” occurs with each.

– Hand tool and instrument are second order co-occurrences via “repair”

Page 58: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 58

Gloss Vector Measure of Semantic Relatedness

● Replace words or terms in definitions with vector of co-occurrences observed in corpus

● Defined concept now represented by an averaged vector of co-occurrences

● Measure relatedness of concepts via cosine between their respective vectors

● Patwardhan and Pedersen, 2006– Inspired by Schutze, 1998

Page 59: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 59

Experimental Results● Vector > Lesk > Info Content > Depth > Path

– Clear trend across various studies

● Big differences in intrinsic evaluations (Vector > Lesk >> Info Content > Depth > Path)

– Banerjee and Pedersen, 2003 (IJCAI)

– Pedersen, et al. 2007 (JBI)

● Smaller differences in extrinsic evaluations– Human raters mix up similarity &

relatedness?

Page 60: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 60

Questions?

Page 61: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 61

References● S. Banerjee and T. Pedersen. Extended gloss overlaps as a

measure of semantic relatedness. In Proceedings of the Eighteenth International Joint Conference on Artificial Intelligence, pages 805-810, Acapulco, August 2003. (lesk)

● J. Jiang and D. Conrath. Semantic similarity based on corpus statistics and lexical taxonomy. In Proceedings on International Conference on Research in Computational Linguistics, pages 19-33, Taiwan, 1997. (jcn)

● C. Leacock and M. Chodorow. Combining local context and WordNet similarity for word sense identification. In C. Fellbaum, editor, WordNet: An electronic lexical database, pages 265-283. MIT Press, 1998. (lch)

Page 62: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 62

References● M.E. Lesk. Automatic sense disambiguation using machine

readable dictionaries: how to tell a pine code from an ice cream cone. In Proceedings of the 5th annual international conference on Systems documentation, pages 24-26. ACM Press, 1986.

● D. Lin. An information-theoretic definition of similarity. In Proceedings of the International Conference on Machine Learning, Madison, August 1998. (lin).

● S. Patwardhan and T. Pedersen. Using WordNet-based Context Vectors to Estimate the Semantic Relatedness of Concepts. In Proceedings of the EACL 2006 Workshop on Making Sense of Sense: Bringing Computational Linguistics and Psycholinguistics Together, pages 1-8, Trento, Italy, April 2006. (vector, vector_pairs)

Page 63: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 63

References● R. Rada, H. Mili, E. Bicknell, and M. Blettner. Development and

application of a metric on semantic nets. IEEE Transactions on Systems, Man and Cybernetics, 19(1):17-30, 1989. (path)

● P. Resnik. Using information content to evaluate semantic similarity in a taxonomy. In Proceedings of the 14th International Joint Conference on Artificial Intelligence, pages 448-453, Montreal, August 1995. (res)

● H. Schütze. Automatic word sense discrimination. Computational Linguistics, 24(1):97-123, 1998.

Page 64: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 64

Using Open Source Software

Page 65: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 65

Using Open Source Software

● Packages providing the “standard” measures● Implementations of specific measures ● Overview of WordNet::Similarity usage

Page 66: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 66

Software that provides some of the “standard” measures

Page 67: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 67

WordNet::Similarity

● Similarity and Relatedness for WordNet– http://wordnet.princeton.edu

● Written in Perl (starting in 2002)● Offers command line, web interface, and API

– http://wn-similarity.sourceforge.net

● We'll come back to this for some examples

Page 68: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 68

ws4j

● Java Re-implementation of WordNet::Similarity– https://code.google.com/p/ws4j/

● Includes path, depth, info content, hso, and lesk measures

● Online demo– http://ws4jdemo.appspot.com/

Page 69: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 69

NLTK

● Natural Language Toolkit – Includes path, depth, and information

content measures

● Written in Python– General purpose NLP toolkit

● Parsers, part of speech taggers, and more

– http://nltk.org/

Page 70: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 70

DKPro Similarity● Semantic similarity using vector space models

like LSA and ESA, and also WordNet – Implemented using UIMA

– https://code.google.com/p/dkpro-similarity-asl/

● Part of the much larger DKPro project, which provides UIMA wrappers for many existing tools and models

● Supports measuring similarity of short texts and concept pairs

Page 71: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 71

Semilar

● Semantic similarity using WordNet and LSA– http://semanticsimilarity.org

● Supports measuring similarity of short texts and concept pairs

● Provides many pre-built models using LSA● Includes a web service and API in addition to

downloadable libraries

Page 72: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 72

UMLS::Similarity

● Ports WordNet::Similarity to the UMLS – Unified Medical Language System from

NLM, a data warehouse of medical sources● Freely available, license required

– http://umls-similarity.sourceforge.net

● Perl and mySQL

Page 73: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 73

ProteInOn

● Computes Semantic Similarity for the Gene Ontology (GO) using path and information content measures

– http://geneontology.org/

● Protein Interactions and Ontology– http://lasige.di.fc.ul.pt/webtools/proteinon/

Page 74: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 74

Measure Specific Software

Page 75: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 75

UKB

● Graph based similarity and relatedness measures, using WordNet

– http://ixa2.si.ehu.es/ukb/

● Applies Personalized Page Rank to semantic similarity and relatedness measures, as well as to word sense disambiguation

Page 76: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 76

WMFVEC

● High dimensional vector approach using definitions from WordNet and Wiktionary

– http://www.cs.columbia.edu/~weiwei/code.html#wmfvec

● Supports similarity measurements of short texts and concept pairs

Page 77: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 77

olesk

● Shortest path in weighted semantic network– http://olesk.com/#SemanticRelatedness

● Supports measuring similarity of short texts and concept pairs

Page 78: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 78

Illinois WNSim

● WordNet-based Similarity Metric– https://cogcomp.cs.illinois.edu/page/softwa

re_view/Illinois%20WNSim– Also provides Java version

– https://cogcomp.cs.illinois.edu/page/software_view/Illinois%20WNSim%20(Java)

● Measures similarity of short texts, provides support for similarity of named entities

Page 79: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 79

WordNet::Similarity Usage

Page 80: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 80

WordNet::Similarity● Install WordNet (http://princeton.wordnet.edu)

– Make sure to set $WNHOME

● Install WordNet::QueryData– cpan

– > install WordNet::QueryData

● Install WordNet::Similarity– cpan

– > install WordNet::Similarity

Page 81: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 81

WordNet::Similarity

Page 82: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 82

Command line

● similarity.pl --type WordNet::Similarity::path dog cat

– dog#n#1 cat#n#1 0.2

● similarity.pl --type WordNet::Similarity::path dog#n cat#n

– dog#n#1 cat#n#1 0.2

● similarity.pl --type WordNet::Similarity::path dog#n#2 cat#n#3

– dog#n#2 cat#n#3 0.125

● Similarity.pl –type WordNet::Similarity::path dog cat#n#3

– dog#n#3 cat#n#3 0.142857142857143

Page 83: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 83

Command line● similarity.pl --type WordNet::Similarity::path dog cat --allsenses

– dog#n#1 cat#n#1 0.2

– dog#n#3 cat#n#2 0.2

– dog#n#1 cat#n#7 0.2

– dog#n#7 cat#n#5 0.166666666666667

– dog#n#4 cat#n#2 0.142857142857143

– dog#n#3 cat#n#3 0.142857142857143

– dog#v#1 cat#v#2 0.142857142857143

– dog#n#4 cat#n#3 0.142857142857143

– dog#n#6 cat#n#5 0.142857142857143

– dog#v#1 cat#v#1 0.142857142857143

– dog#n#2 cat#n#2 0.125

– Etc...

Page 84: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 84

WordNet senses● wn cat -over

● Overview of noun cat

● The noun cat has 8 senses (first 1 from tagged texts)

● 1. (18) cat, true cat -- (feline mammal usually having thick soft fur and no ability to roar: domestic cats; wildcats)

● 2. guy, cat, hombre, bozo -- (an informal term for a youth or man; "a nice guy"; "the guy's only doing it for some doll")

● 3. cat -- (a spiteful woman gossip; "what a cat she is!")

● 4. kat, khat, qat, quat, cat, Arabian tea, African tea -- (the leaves of the shrub Catha edulis which are chewed like tobacco or used to make tea; has the effect of a euphoric stimulant; "in Yemen kat is used daily by 85% of adults")

Page 85: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 85

wn cat -over● 5. cat-o'-nine-tails, cat -- (a whip with nine knotted cords;

"British sailors feared the cat")

● 6. Caterpillar, cat -- (a large tracked vehicle that is propelled by two endless metal belts; frequently used for moving earth in construction and farm work)

● 7. big cat, cat -- (any of several large cats typically able to roar and living in the wild)

● 8. computerized tomography, computed tomography, CT, computerized axial tomography, computed axial tomography, CAT -- (a method of examining body organs by scanning them with X rays and using a computer to construct a series of

cross-sectional scans along a single axis

Page 86: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 86

wn cat -over

● Overview of verb cat

● The verb cat has 2 senses (no senses from tagged texts)

● 1. cat -- (beat with a cat-o'-nine-tails)

● 2. vomit, vomit up, purge, cast, sick, cat, be sick, disgorge, regorge, retch, puke, barf, spew, spue, chuck, upchuck, honk, regurgitate, throw up -- (eject the contents of the stomach through the mouth; "After drinking too much, the students vomited"; "He purged continuously"; "The patient regurgitated the food we gave him last night")

Page 87: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 87

Measure type(-type WordNet::Similarity::X)

● Similarity– Shortest Path

● path

– Depth ● wup, lch

– Information Content ● res, lin, jcn

Page 88: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 88

Measure type(-type WordNet::Similarity::X)

● Relatedness– Definition

● lesk

– Definition + Corpus● vector● vector_pairs

– Path finding ● hso

Page 89: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 89

Measure type(-type WordNet::Similarity::X)

● Random– rand

– Very useful for experimental comparisons● How do your results compare with a random

measure of similarity or relatedness??

Page 90: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 90

Similarity measures don't cross part of speech tags

● similarity.pl --type WordNet::Similarity::path dog#n cat#v

– Warning (WordNet::Similarity::path::parseWps()) - dog#n and cat#v belong to different parts of speech.

– dog#n#2 cat#v#1 -1000000

Page 91: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 91

Relatedness measures cross POS boundaries

● similarity.pl --type WordNet::Similarity::lesk dog#n#1 cat#v#1

– dog#n#1 cat#v#1 20

● similarity.pl --type WordNet::Similarity::vector dog#n#1 cat#v#1

– dog#n#1 cat#v#1 0.146581307649965

Page 92: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 92

API● use WordNet::Similarity::wup;

● use WordNet::QueryData;

● my $wn = WordNet::QueryData->new();

● my $wup = WordNet::Similarity::wup->new($wn);

● my $value = $wup->getRelatedness('dog#n#1', 'cat#n#1');

● my ($error, $errorString) = $wup->getError();

● die $errorString if $error;

● print "dog (sense 1) <-> cat (sense 1) = $value\n";

● dog (sense 1) <-> cat (sense 1) = 0.866666666666667

Page 93: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 93

API● my $wn = WordNet::QueryData->new;

● use WordNet::Similarity::PathFinder;

● my $obj = WordNet::Similarity::PathFinder->new ($wn);

● my $wps1 = 'winston_churchill#n#1';

● my $wps2 = 'england#n#1';

● my @paths = $obj->getShortestPath($wps1, $wps2, 'n', 'wps');

● my ($length, $path) = @{shift @paths};

● defined $path or die "No path between synsets";

● print "shortest path between $wps1 and $wps2 is $length edges long\n";

● print "@$path\n";

● shortest path between winston_churchill#n#1 and england#n#1 is 14 edges long winston_churchill#n#1 writer#n#1 communicator#n#1 person#n#1 causal_agent#n#1 physical_entity#n#1 object#n#1 location#n#1 region#n#3 district#n#1 administrative_district#n#1 country#n#2 European_country#n#1 england#n#1

Page 94: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 94

Web Interface

Page 95: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 95

Web Interface

Page 96: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 96

Web Interface

Page 97: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 97

Web Interface

Page 98: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 98

Web Interface

Page 99: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 99

Web Interface

Page 100: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 100

Web Interface● If you like the web interface, you can run your

own version!– similarity_server.pl

– All necessary html and cgi files included

Page 101: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 101

Other Utilities

● Build new information content files – by default counts come from SemCor

– BNCFreq.pl

– brownFreq.pl

– treebankFreq.pl

– rawtextFreq.pl

● compounds.pl – list all WordNet compounds● wnDepths.pl – list all WordNet depths

Page 102: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 102

Questions?

Page 103: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 103

References● Eneko Agirre, Enrique Alfonseca, Keith Hall, Jana Kravalova, Marius

Pasca and Aitor Soroa. 2009. A Study on Similarity and Relatedness Using Distributional and WordNet-based Approaches. Proceedings of NAACL-HLT 09. Boulder, USA. (ukb)

● Daniel Bär, Torsten Zesch, and Iryna Gurevych. DKPro Similarity: An Open Source Framework for Text Similarity, in Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics: System Demonstrations, pages 121-126, August 2013, Sofia, Bulgaria. (pdf) (bib) (dkpro-similarity)

● Steven Bird, Ewan Klein, and Edward Loper (2009). Natural Language Processing with Python. O’Reilly Media Inc. (nltk)

● Q. Do and D. Roth and M. Sammons and Y. Tu and V. Vydiswaran, Robust, Light-weight Approaches to compute Lexical Similarity. Computer Science Research and Technical Reports, University of Illinois (2009) (Illionois WNSim)

Page 104: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 104

References● Weiwei Guo and Mona Diab. "Improving Lexical Semantics for Sentential

Semantics: Modeling Selectional Preference and Similar Words in a Latent Variable Model". In Proceedings of NAACL, 2013, Atlanta, Georgia, USA. (wmfvec)

● Bridget McInnes, Ted Pedersen, and Serguei Pakhomov, UMLS-Interface and UMLS-Similarity : Open Source Software for Measuring Paths and Semantic Similarity - Appears in the Proceedings of the Annual Symposium of the American Medical Informatics Association, Nov 14-18, 2009, pp. 431-435, San Francisco, CA (umls-similarity)

● Ted Pedersen, Siddharth Patwardhan and Jason Michelizzi, WordNet::Similarity - Measuring the Relatedness of Concepts - Appears in the Proceedings of Fifth Annual Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL-04), pp. 38-41, May 3-5, 2004, Boston, MA. (wordnet-similarity)

Page 105: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 105

References● Rus, V., Lintean, M., Banjade, R., Niraula, N., and Stefanescu, D.

(2013). SEMILAR: The Semantic Similarity Toolkit. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, August 4-9, 2013, Sofia, Bulgaria. (semilar)

● Reda Siblini and Leila Kosseim (2013). Using a Weighted Semantic Network for Lexical Semantic Relatedness. In Proceedings of Recent Advances in Natural Language Processing (RANLP 2013), September, Hissar, Bulgaria. (olesk)

Page 106: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 106

Similarity and Relatedness in the Wild :How do we know it's working?

Page 107: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 107

Intrinsic Evaluation● Develop your own measure● Score it using pairs for which human reference

standard is available● Compare correlation between your measure

and established measures – Spearman's rank correlation often used

– rank.pl in Ngram Statistics Package● http://ngram.sourceforge.net

Page 108: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 108

Intrinsic Evaluation● Replication proves to be very difficult!● Many factors, see ACL 2013 paper

– Offspring from Reproduction Problems: What Replication Failure Teaches Us (Fokkens, van Erp, Postma, Pedersen, Vossen, and Freire) - Appears in the Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics, August 4-9, 2013, pp. 1691-1701, Sofia, Bulgaria.

– http://aclweb.org/anthology//P/P13/P13-1166.pdf

Page 109: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 109

Reference Standards● Rubenstein and Goodenough, 1965

– 65 pairs

– Assessed by 50 undergraduate students

– http://www.d.umn.edu/~tpederse/Data/rubenstein-goodenough-1965.txt

● Miller and Charles, 1991– 30 pair subset of R&G

– Re-assessed by 38 undergraduate students

– http://www.d.umn.edu/~tpederse/Data/miller-charles-1991.txt

Page 110: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 110

Rubenstein and Goodenough pairsno similarity (0.0) – synonyms (4.0)

● 3.94 gem jewel● 3.92 automobile car● 2.41 brother lad● 2.37 crane implement● 0.04 rooster voyage● 0.04 noon string

– Rubenstein & Goodenough

● 3.92 car automobile● 3.84 gem jewel● 1.66 lad brother● 1.68 crane implement● 0.08 rooster voyage● 0.08 noon string

– Miller & Charles

Page 111: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 111

Reference Standards• WordSim-353, 2002

– 153 pairs assessed by 13 subjects

– 200 pairs assessed by 16 subjects

– Includes the Miller and Charles pairs (re-assessed)

● http://www.cs.technion.ac.il/~gabr/resources/data/wordsim353/

Page 112: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 112

Reference Standards

● Yang and Powers, 2006● 130 verb pairs

– Assessed by 2 academic staff and 4 graduate students

– How related in meaning is the pair? ● 0 for not at all ● 4 for inseperably related

– http://david.wardpowers.info/Research/AI/papers/200601-GWC-130verbpairs.txt

Page 113: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 113

Reference standards● Mturk 771, 2012● 771 word pairs scored for relatedness by

Mechanical Turkers● At least 20 judgements per pair

– 1 for not related, 5 for highly related

– 50 ratings per Turker

– http://www2.mta.ac.il/~gideon/mturk771.html

Page 114: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 114

Reference Standards● MWE-300, 2012

– 300 pairs where 216 are multi-word expressions and 84 are word pairs

– Assessed by 5 native speakers on scale of 0 to 1

– http://adapt.seiee.sjtu.edu.cn/similarity/

Page 115: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 115

Reference Standards● Rel-122, 2013● Relatedness scores for 122 noun pairs,

created at University of Central Florida– Each pair assessed by at least 20

undergraduate students

– 0 for completely unrelated, 4 for strongly related

– http://www.cs.ucf.edu/~seansz/rel-122/

Page 116: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 116

Reference standards● MayoSRS, 2007

– 101 pairs of medical concepts

– Assessed by 13 medical coders and 3 physicians, all from Mayo Clinic

● 1 for not at all related, 4 for nearly synonymous

– MiniMayoSRS – a highly reliable subset of 29 pairs

● http://rxinformatics.umn.edu/SemanticRelatednessResources.html

Page 117: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 117

Reference standardsUMNSRS, 2010

– 566 pairs of medical concepts assessed for similarity by 8 medical students / residents

– 587 pairs of medical concepts assessed for relatedness by 8 medical students / residents

● Assessed on a continuous scale (0 – 1500)● http://rxinformatics.umn.edu/SemanticRelated

nessResources.html

Page 118: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 118

Reference Standards● Lexical & Distributional Semantics Evaluation

Benchmarks, maintained by Manaal Faruqui

– http://www.cs.cmu.edu/~mfaruqui/suite.html● ACL Wiki (various datasets for related tasks)

– http://aclweb.org/aclwiki/index.php?title=Similarity_(State_of_the_art)http://aclweb.org/aclwiki/index.php?title=Knowledge_collections_and_datasets_(English)

● SemEval (many related tasks with data)

– http://aclweb.org/aclwiki/index.php?title=SemEval_Portal

Page 119: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 119

Extrinsic Evaluation

● Synonym Tests● Word Sense Disambiguation● Semantic Textual Similarity● Recognizing Textual Entailment

Page 120: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 120

ESL Synonym Tests

● Provide one target word in context● Select “closest” synonym from a list of 4● Used in previous versions of TOEFL and other

standardized tests

● http://aclweb.org/aclwiki/index.php?title=ESL_Synonym_Questions_(State_of_the_art)

● 50 question data set available from Peter Turney

– http://www.apperceptual.com/

Page 121: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 121

ESL Synonym Tests

● Stem: "A rusty nail is not as strong as a clean, new one."

● Choices: – (a) corroded

– (b) black

– (c) dirty

– (d) painted

Page 122: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 122

ESL Synonym Tests● similarity.pl --type WordNet::Similarity::vector --file rusty.txt

● rusty#a#1 dirty#a#1 0.175879883782967

● rusty#a#1 painted#a#1 0.0844532311114079

● rusty#a#1 black#a#2 0.0656019491836669

● rusty#a#1 corroded#a#1 0.0357324093083641

– :(

Page 123: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 123

TOEFL Synonym Tests

● Rusty and other words are adjectives● Must used relatedness measure

– lesk – vector– vector_pairs– hso

● Should do word sense disambiguation first

Page 124: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 124

Word Sense Disambiguation● The meanings of words that occur together in

a context will likely be related – If a word has multiple senses, it will most

likely be used in the sense that is most related to the senses of it's neighbors

– Relatedness seems to matter more than similarity, unless you have a list

● I have a horse, a cat and a cow at my farm.

Page 125: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 125

Word Sense Disambiguation

● SenseRelate Hypothesis : Most words in text will have multiple possible senses and will often be used with the sense most related to those of surrounding words

– He either has a cold or the flu● Cold not likely to mean air temperature

Page 126: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 126

SenseRelate● In coherent text words will be used in similar or

related senses, and these will also be related to the overall topic or mood of a text

● First applied to WSD in 2002– Banerjee and Pedersen, 2002 (WordNet)

– Patwardhan et al., 2003 (WordNet)

– Pedersen and Kolhatkar 2009 (WordNet)

– McInnes et al., 2011 (UMLS)

Page 127: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 127

Implementations● WordNet::SenseRelate

– AllWords, TargetWord, WordToSet

– http://senserelate.sourceforge.net● Includes command line, API, and web

interface

● UMLS::SenseRelate– AllWords

– http://search.cpan.org/dist/UMLS-SenseRelate/

Page 128: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 128

Web Interface

Page 129: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 129

Web Interface

Page 130: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 130

SenseRelate for WSD● Assign each word the sense which is most

similar or related to one or more of its neighbors

– Pairwise

– 2 or more neighbors

● Pairwise algorithm results in a trellis much like in HMMs

– More neighbors adds lots of information and a lot of computational complexity

Page 131: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 131

SenseRelate - pairwise

Page 132: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 132

SenseRelate – 2 neighbors

Page 133: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 133

General Observations on WSD Results

● Nouns more accurate; verbs, adjectives, and adverbs less so

● Increasing the window size nearly always improves performance

● Jiang-Conrath measure often a high performer for nouns (e.g., Patwardhan et al. 2003)

● Vector and lesk have coverage advantage– handle mixed pairs while others don't

Page 134: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 134

SenseRelate Sentiment Classification● The underlying sentiment of a text can be

discovered by determining which emotion is most related to the words in that text.

– Related to happy? : love, food, success, ...

– Similar to happy? : joyful, ecstatic, ...

– Pairwise comparisons between emotion and senses of words in context

● Same form as Naive Bayesian model– WordNet::SenseRelate::WordToSet

Page 135: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 135

SenseRelate - WordToSet

Page 136: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 136

Experimental Results● Sentiment classification results in 2011 i2b2

suicide notes challenge were disappointing (Pedersen, 2012)

– Suicide notes not very emotional!

– In many cases reflect a decision made and focus on settling affairs

Page 137: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 137

Semantic Textual Similarity (STS)● How similar (semantically) are 2 texts?

– The Senate Select Committee on Intelligence is preparing a blistering report on prewar intelligence on Iraq.

– American intelligence leading up to the war on Iraq will be criticized by a powerful US Congressional committee due to report soon, officials said today

● http://www-nlp.stanford.edu/wiki/STS

Page 138: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 138

Semantic Textual Similarity (STS)● Combined distributional and WordNet

information to learn a model from training data– UKP: Computing Semantic Textual Similarity by Combining

Multiple Content Similarity Measures,Daniel Bär, Chris Biemann, Iryna Gurevych, and Torsten Zesch, Semeval 2012

● LSA Boosted with WordNet– UMBC EBIQUITY-CORE: Semantic Textual Similarity Sy

stemsLushan Han, Abhay L. Kashyap, Tim Finin, James Mayfield, and Johnathan Weese, *Sem 2013

Page 139: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 139

Recognizing Textual Entailment (RTE)

● A text entails a hypothesis if a human reading the text would infer that the hypothesis is true

– Text : The Christian Science Monitor named a US journalist kidnapped in Iraq as freelancer Jill Carroll.

– Hypothesis: Jill Carroll was abducted in Iraq.

– Hypothesis: The Christian Science Monitor kidnapped a freelancer.

Page 140: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 140

RTE methods and data● Long series of shared tasks

– 2004 to present

– http://aclweb.org/aclwiki/index.php?title=Textual_Entailment_Resource_Pool

● Recognizing that T and H are similar is helpful, although does not really solve the problem

– Hybrid approaches (like with STS)

Page 141: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 141

Applications

● Semantic similarity and relatedness are important components of many NLP applications

– Crucial building blocks

– Interesting to study in their own right

Page 142: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 142

Thank you!

If you have any suggestions for content that should be added to or changed in this tutorial, please let me know! Any other comments are

welcome too.

[email protected]

Questions?

Page 143: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 143

References● S. Banerjee and T. Pedersen. An adapted Lesk algorithm for word sense

disambiguation using WordNet. In Proceedings of the Third International Conference on Intelligent Text Processing and Computational Linguistics, pages 136—145, Mexico City, February 2002. (wsd result)

● D. Faria, C. Pesquita, F. M. Couto, and A. Falcão, ProteInOn: A Web Tool for Protein Semantic Similarity, Technical Report, Department of Informatics, University of Lisbon, 2007 (proteinon)

● L. Finkelstein, E. Gabrilovich, Y. Matias, E. Rivlin, Z. Solan and G. Wolfman (2002). Placing Search in Context: The Concept Revisited. ACM Transactions on Information Systems, 20(1), 116-131. (wordsim-353)

● B. McInnes, T. Pedersen, Y. Liu, G. Melton and S. Pakhomov. Knowledge-based Method for Determining the Meaning of Ambiguous Biomedical Terms Using Information Content Measures of Similarity. Appears in the Proceedings of the Annual Symposium of the American Medical Informatics Association, pages 895-904, Washington, DC, October 2011. (wsd result)

Page 144: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 144

References● G. A. Miller and W. G. Charles (1991). Contextual Correlates of Semantic

Similarity. Language and Cognitive Processes, 6(1), 1-28.

● S. Pakhomov, B. McInnes, T. Adam, Y. Liu, T. Pedersen, and G Melton, Semantic Similarity and Relatedness between Clinical Terms : An Experimental Study - Appears in the Proceedings of the Annual Symposium of the American Medical Informatics Association, November 13-17, 2010, pp. 572 - 576, Washington, DC. (umnsrs)

● S. Patwardhan, S. Banerjee, and T. Pedersen. Using measures of semantic relatedness for word sense disambiguation. In Proceedings of the Fourth International Conference on Intelligent Text Processing and Computational Linguistics, pages 241—257, Mexico City, February 2003. (wsd result)

● S. Patwardhan and T. Pedersen. Using WordNet-based Context Vectors to Estimate the Semantic Relatedness of Concepts. In Proceedings of the EACL 2006 Workshop on Making Sense of Sense: Bringing Computational Linguistics and Psycholinguistics Together, pages 1-8, Trento, Italy, April 2006. (wsd result)

Page 145: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 145

References● T. Pedersen and V. Kolhatkar. WordNet :: SenseRelate :: AllWords - a

broad coverage word sense tagger that maximizes semantic relatedness. In Proceedings of the North American Chapter of the Association for Computational Linguistics - Human Language Technologies 2009 Conference, pages 17-20, Boulder, CO, June 2009. (wsd result)

● T. Pedersen, S. Pakhomov, S. Patwardhan, and C. Chute. Measures of semantic similarity and relatedness in the biomedical domain. Journal of Biomedical Informatics, 40(3) : 288-299, June 2007. (mayosrs)

● T. Pedersen. Rule-based and lightly supervised methods to predict emotions in suicide notes. Biomedical Informatics Insights, 2012:5 (Suppl. 1):185-193, January 2012. (sentiment result)

● H. Rubenstein and J. B. Goodenough (1965). Contextual Correlates of Synonymy. Communications of the ACM, 8(10), 627-633.

Page 146: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 146

References● SemEval-2012 Task 6: A Pilot on Semantic Textual Similarity, Eneko

Agirre, Daniel Cer, Mona Diab, Aitor Gonzalez-Agirre, Semeval 2012 (sts shared task)

● S. Szumlanski, F. Gomez and V.K. Sims (2013). A New Set of Norms for Semantic Relatedness Measures. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) (pp. 890-895). Sofia, Bulgaria. (rel-122)

● P. D. Turney (2001). Mining the Web for synonyms: PMI-IR versus LSA on TOEFL. Proceedings of the Twelfth European Conference on Machine Learning (ECML-2001), Freiburg, Germany, pp. 491-502. (toefl synonyms)

● W. Wu, H. Li, H. Wang, and K. Q. Zhu. Probase: a probabilistic taxonomy for text understanding. In Proceedings of SIGMOD'12, pages 481-492, 2012. (mwe-300)

● D. Yang and D.M. W. Powers (2006). Verb Similarity on the Taxonomy of WordNet. Proceedings of the Third International WordNet Conference (GWC-06) (pp. 121-128). Jeju Island, Korea.

Page 147: MICAI 2013 Tutorial Slides - Measuring the Similarity and Relatedness of Concepts

November 25, 2013 MICAI-2013 Tutorial 147

The End!