Representing Meaning in Unsupervised Word Sense Disambiguation

Preview:

DESCRIPTION

Bridget T. McInnes 5 September 2008. Representing Meaning in Unsupervised Word Sense Disambiguation. University of Minnesota Twin Cities. What is WSD?. The culture count doubled. Culture. Anthropological Culture. Laboratory Culture. Sense Inventory. Approaches to WSD. Supervised - PowerPoint PPT Presentation

Citation preview

1

Representing Meaning in Unsupervised Word Sense

Disambiguation

Bridget T. McInnes

5 September 2008

University of Minnesota Twin Cities

2

What is WSD?

The culture count doubled.

Culture

LaboratoryCulture

AnthropologicalCulture

Sense Inventory

3

Approaches to WSD

SupervisedAdvantages: obtains a high accuracyDisadvantages: manually annotated training data is required for each word that needs to be disambiguated therefore it can not scale

UnsupervisedAdvantages: does not require manually annotated training dataDisadvantages: generally does not obtain as high of an accuracy as supervised approaches

4

Unsupervised Approaches

Similarity and Relatedness Based

5

Unsupervised Approaches

Similarity and Relatedness BasedPatwardhan, Banerjee and Pedersen 2005Pedersen, et al 2006Budanitsky and Hirst 2006

6

Unsupervised Approaches

Similarity and Relatedness based

Vector Based

7

Unsupervised Approaches

Similarity and Relatedness Based

Vector-basedMohammad and Hirst, 2006Patwardhan, 2003Pedersen, et al 2006Humphrey, et al 2006

8

Unsupervised Approaches

Similarity and Relatedness-based

Vector-based

Clustering

9

Unsupervised Approaches

Similarity and Relatedness based

Vector-based

ClusteringPedersen and Bruce, 1997Shütze, 1998Pedersen and Bruce, 1998Purandare and Pedersen, 2004Kulkarni and Pedersen, 2005

10

Road Map

Previous Approaches

Our vector approach

Future Work

11

Previous Approaches

Similarity and Relatedness Based

SenseRelate (Banerjee and Pedersen, 2003)

Vector-based

Semantic Type Indexing (Humphrey et al 2006)

Clustering

SenseClusters (Kulkarni and Pedersen, 2005)

12

Banerjee and Pedersen 2003

Sense Relate

13

SenseRelateTarget Word: Transport

Concept 1: Biological Transport (C0005528)

Concept 2: Patient Transport (C0150390)

Transport of glutathione S-linked conjugates.

glutathione S-linked conjugates.

C0017817C0522529 C0301869

C0005528 = SS + SS + SS = Total SS for Concept 1

14

SenseRelateTarget Word: Transport

Concept 1: Biological Transport (C0005528)

Concept 2: Patient Transport (C0150390)

Transport of glutathione S-linked conjugates.

glutathione S-linked conjugates.

C0017817C0522529 C0301869

C0150390 = SS + SS + SS = Total SS for concept 2

C0005528 = SS + SS + SS = Total SS for concept 1

15

Humphrey et al, 2006

Semantic Type Indexing for WSD

16

Semantic Type Indexing (STI) Target Word: Transport

Concept 2 Vector

Concept 1 Vector

Target Word VectorCosine 2

Cosine 1

Concept 1: Biological TransportSemantic type: Cell Function

Concept 2: Patient TransportSemantic type: Health Care Activity

JDI

CV1 – JDI vectorCV2 – JDI vector

TW – JDI vector

Transport of glutathione S-linked conjugates.

17

Target Word Vector

Transport of glutathione S-linked conjugates.

Contains the words surrounding the ambiguous word

18

STI - Target Word Vectors

Transport of glutathione S-linked conjugates.

Contains the words surrounding the ambiguous word

19

STI -Concept Vectors

The concept vectors are created based on their semantic type(s)

Transport:C0005528: Biological TransportC0150390: Patient Transport

C0005528

C0150390

Cell FunctionOne word terms in the Metathesaurus associated with Cell Function

Health Care Activity One word terms in the Metathesaurus associated with Health Care Activity

20

Kulkarni and Pedersen, 2005

SenseClusters

21

Sense Clusters (SC)Target Word: Transport

Concept 1: Biological TransportConcept 2: Patient Transport

Instance 1Instance 2Instance 3Instance 4Instance 5Instance 6Instance 7Instance 8Instance 9Instance 10Instance 11Instance 12Instance 13…

Concept 1

Concept 2

Transport of glutathione S-linked conjugates.

22

Sense Clusters (SC)

Instance 1Instance 2Instance 3Instance 4Instance 5Instance 6Instance 7Instance 8Instance 9Instance 10Instance 11Instance 12Instance 13…

Concept 1

Concept 2

Target Word: Transport

Concept 1: Biological TransportConcept 2: Patient Transport

Transport of glutathione S-linked conjugates.

23

Sense Clusters

Concept 2 Vector

Concept 1 Vector

Target Word Vector

Cosine 2

Cosine 1

Target Word: Transport

Concept 1: Biological TransportConcept 2: Patient Transport

Transport of glutathione S-linked conjugates.

24

SC -Vectors

Contain the words surrounding the ambiguous word

Created using:

First order co-occurrences

Second order co-occurrences

25

First Order Co-occurrence Vectors

glutathione S-linked conjugates

Word 1

Word 2

Word N

.

.

.

.

.

.

.

50

6

5

.

.

.

5

6

1

.

.

.

5

0

15

.

.

.

20

4

7

TargetVector

26

Second Order Co-occurrence Vectors

Word 1

Word 2

Word N

.

.

.

.

.

.

.

10

30

0

1st orderglutathione

20 10 0

10

0

0

2

50

2

… …

Word1 Word 2 … Word N

0 2 2…

2nd orderglutathione

27

Second Order Co-occurrence Vectors

S-linked conjugates

Word 1

Word 2

Word N

.

.

.

.

.

.

.

10

30

2

.

.

.

0

6

0

.

.

.

5

0

13

.

.

.

5

13

5

TargetVector

glutathione

28

Our unsupervised approach

29

CuiTools ApproachOur approach uses a general vector approach with SenseCluster vectors

30

CuiTools

Concept 2 Vector

Concept 1 Vector

Target Word Vector

Cosine 2

Cosine 1

Target Word: Transport

Concept 1: Biological Transport (C0005528)

Concept 2: Patient Transport (C0150390)

Transport of glutathione S-linked conjugates.

31

CuiTools Approach

We explore using

First-order co-occurrence vectors

Second-order co-occurrence vectors

Our approach uses a general vector approach with SenseCluster vectors

32

Target Word Vector

Contains the words surrounding the ambiguous word

Transport of glutathione S-linked conjugates.

33

CuiTools - Concept Vectors

How to create a vector that can represent the meaning of

a concept for word sense disambiguation?

34

To answer this question

We explore information in the UMLS that can be used to

represent the meaning of a concept.

35

CuiTools - Concept Vectors

Adjustment

Individual AdjustmentConceptually broad term referring to a state of harmony between internal needs and external …

Adjustment ActionThe act of making necessary corrections or modifications …

Psychological AdjustmentA state of harmony between internal needs and external demands and the processes used …

CUI definition

36

CuiTools - Concept Vectors

Blood Pressure

Blood PressureForce exerted by the blood on the walls of the arteries and other vessels.

Blood Pressure DeterminationActions performed to measure the diastolic and systolic pressure of the blood.

Arterial PressureNO DEFINTION

CUI definition

37

CuiTools - Concept Vectors

CUI definitionUse CUI definition but if it doesn’t exist

PARent definitionSemantic Type definition

SYNonymous terms

For example:C0430400: Laboratory Culture

laboratory culturemicrobial culturesample culture

38

CuiTools - Concept Vectors

CUI definition

PARent definitionSemantic Type definition

SIBlings

For example:C0010453: Anthropological Culture

archeologyfamilysocial groups

If CUI definition doesn’t exist

SYNonymous terms

39

CuiTools - Concept Vectors

CUI definitionIf CUI definition doesn’t exist

PARent definitionSemantic Type definition

SIBlings

SYNonymous terms

TOP 50 most frequent words surrounding the terms associated with the CUI

40

Dataset

National Library of Medicine's Word Sense Disambiguation (NLM-WSD) Dataset

50 words from the 1998 MEDLINE abstracts

100 instances for each of the 50 words

The target word was manually assigned a UMLS concept or None

All instances of None were removed

Average number of concepts per ambiguous word is 2.26

41

Data subsets

Humphrey subset

Humphrey, et al 2006

45 out of the 50 words in NLM-WSD

5 words were excluded because at least two of the possible concepts associated with these words have the same semantic type

Instances that were assigned “None” were removed

42

Training Data

The training data used to create the 1st and 2nd order co-occurrence vectors is

2005 Medline baseline

43

Results

Results

45

Results of Co-occurrence Vectors

46

Results of the Representations of Meaning

47

Results of the Representations of Meaning - CUI

Adding the parent and semantic type definitions decreased the accuracy by 6 and 7 percentage points

Parent and semantic type definitions are too broad to define the meaning of a concept

48

Results of the Representations of Meaning - SYN

Using the synonymous terms associated with a concept is too narrow to represent the meaning.

Adjustment ActionAdjustment – actionAdjustmentsAdjustment, NOSAdjustment – action qualifier valueAdjustment – action procedure

49

Results of the Representations of Meaning - SIB

Using the terms associated the siblings of a concept is too broad to represent the meaning.

Adjustment ActionBiopsyCauterisationCauteryCold TherapyDesiccationDrainage procedureElectrolysis

50

Results of the Representations of Meaning

51

Supervised versus Unsupervised

Joshi McInnes Stevenson SenseClusters Humphrey CuiTools et al 04 et al 07 et al 08 et al 06

52

To recap

How to create a vector that can represent the meaning of

a concept for word sense disambiguation?

53

Conclusions

To answer this we explored information in the UMLS that could be used to represent the meaning of a concept

Finding a context to represent the meaning of a concept is difficult

We found using the top 50 most frequent words surrounding the terms associated with the concept best represented the concept for the task of word sense disambiguation

54

Take away message

Unsupervised approaches are showing promise

Their disadvantage due to supervised approaches obtaining a higher disambiguation accuracy is slowly disappearing

But we are not there yet … so there is more work to do

55

Future Work

UMLS-Similarity package

Using the Semantic Similarity scores rather than frequency in the 1st order co-occurrence vectors

56

First Order Co-occurrence Vectors

glutathione S-linked conjugates

Word 1

Word 2

Word N

.

.

.

.

.

.

.

50

6

5

.

.

.

5

6

1

.

.

.

5

0

15

.

.

.

20

4

7

TargetVector

FREQ (glutathione, word N) Average

57

First Order Co-occurrence Vectors

glutathione S-linked conjugates

Word 1

Word 2

Word N

.

.

.

.

.

.

.

.5

.6

.5

.

.

.

.5

.6

.1

.

.

.

.5

0

.15

.

.

.

.75

.6

.25

TargetVector

Similarity (glutathione, word N) Average

58

First Order Co-occurrence Vectors

glutathione S-linked conjugates

Word 1

Word 2

Word N

.

.

.

.

.

.

.

.5

.6

.5

.

.

.

.5

.6

.1

.

.

.

.5

0

.15

.

.

.

1.5

1.2

.75

TargetVector

Similarity (glutathione, word N) Sum (like SenseRelate)

59

First Order Co-occurrences

glutathione

Word 1

Word 2

Word N

.

.

.

.

.

.

.

.5

.6

.5

Word N

(C0005528)

.3+ .2

C0000000 C0000001

Similarity = = .5

C0005528

60

Future Work

UMLS-Similarity package

Creating 2nd order co-occurrence matrices based on highly similar concepts rather than words in text

Using the Semantic Similarity scores rather than frequency in the 1st order co-occurrence vectors

61

Second Order Co-occurrence Vectors

Word 1

Word 2

Word N

.

.

.

.

20 10 0

10

0

0

2

50

2

… …

Word1 Word 2 … Word N

Words come from training corpus

Frequency counts

62

Second Order Co-occurrence Vectors

CUI 1

CUI 2

CUI N

.

.

.

.

.20 .10 0

.10

0

0

.20

.50

.20

… …

CUI1 CUI2 … CUI N

Use concepts from the UMLS

Similarity scores

63

Future Work

UMLS-Similarity package

Creating 2nd order co-occurrence matrices based on highly similar concepts rather than co-occurrences in text

Use terms associated with CUIs that have a high similarity score with the possible concept to represent the meaning of the concept

Using the Semantic Similarity scores rather than frequency in the 1st order co-occurrence vectors

64

Similarity Scores

What is potentially gained by using the similarity or relatedness measures

May catch words/concepts that are similar but do not frequently occur together in the training data

culture and ethnology

Ethnology is the study of anthropology

ethnology appears with culture only five times in the training data

The concepts Anthropological Culture and Ethnology would have a high similarity score where as Laboratory culture and Ethnology would not

65

Software

CuiTools version 0.19

http://cuitools.sourceforge.net

66

Thank you

Lan AronsonFrançois LangJim MorkAurélie NévéolWill Rogers

Olivier BodenreiderAllen BrowneMay CheyDina Demner-FushmanGuy DivitaKin Wah FungSusanne HumphreyDwayne McCullyTom RindfleschSuresh Srinivasan

Recommended