24
University of Nebraska - Lincoln DigitalCommons@University of Nebraska - Lincoln Library Philosophy and Practice (e-journal) Libraries at University of Nebraska-Lincoln 2015 Metadata And Linked Data in Word Sense Disambiguation Mahew Corsmeier San Jose State University, [email protected] Follow this and additional works at: hp://digitalcommons.unl.edu/libphilprac Part of the Artificial Intelligence and Robotics Commons , Computational Linguistics Commons , and the Library and Information Science Commons Corsmeier, Mahew, "Metadata And Linked Data in Word Sense Disambiguation" (2015). Library Philosophy and Practice (e-journal). Paper 1227. hp://digitalcommons.unl.edu/libphilprac/1227

Metadata And Linked Data in Word Sense Disambiguation

Embed Size (px)

Citation preview

Page 1: Metadata And Linked Data in Word Sense Disambiguation

University of Nebraska - LincolnDigitalCommons@University of Nebraska - Lincoln

Library Philosophy and Practice (e-journal) Libraries at University of Nebraska-Lincoln

2015

Metadata And Linked Data in Word SenseDisambiguationMatthew CorsmeierSan Jose State University, [email protected]

Follow this and additional works at: http://digitalcommons.unl.edu/libphilprac

Part of the Artificial Intelligence and Robotics Commons, Computational Linguistics Commons,and the Library and Information Science Commons

Corsmeier, Matthew, "Metadata And Linked Data in Word Sense Disambiguation" (2015). Library Philosophy and Practice (e-journal).Paper 1227.http://digitalcommons.unl.edu/libphilprac/1227

Page 2: Metadata And Linked Data in Word Sense Disambiguation

METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 1

Introduction

Word Sense Disambiguation (WSD) is referred to as an “AI-complete” problem

(Mallery, 1998), i.e., a task that is relatively easy for people, but considerably more difficult for

machines. If someone makes a query for a polysemous word (e.g., “plant,” “bass,” “mercury,”

etc…), how is an information retrieval system to understand which sense of the word is

intended? There exist tried-and-tested methods, such as just using the most predominant sense

of the word (McCarthy, Koeling, Weeds, & Carroll, 2004); or looking at the words next to the

query term to determine the statistically most likely meaning (Jurafsky & Martin, 2009; Manning

& Schütze, 1999); but these methods often produce less-than-satisfactory results [often around

70%] (Navigli, 2009). Furthermore, these methods have been heavily dependent on the manual

creation of knowledge sources (Edmonds, 2000), which are expensive to create and subject to

change, thus creating what is termed a knowledge acquisition bottleneck (Gale, Church, &

Yarowsky, 1992). Linked Data technologies (Berners-Lee, 2006), however, allow us to utilize

existing ontologies and lexica, which can then be exploited to improve the automatic semantic

understanding of the word. This paper will examine several systems that purport to

disambiguate words by using Linked Data, and some of the models these systems use to ensure

interoperability.

Literature Review

The most complete treatment of the subject of WSD is arguably Agirre & Edmonds [ed.]

(2007), which presents a detailed definition of the problem, along with a history thereof, and

numerous algorithms which are used in practice. Kwong (2013) offers slightly more recent

coverage, along with predictions as to how WSD methods will evolve in the near future.

Page 3: Metadata And Linked Data in Word Sense Disambiguation

METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 2

Generalists might find sufficient the survey from Navigli (2009), or the chapters covering WSD

in either Jurafsky & Martin (2009) or Manning & Schütze (1999). SemEval [which was

originally named Senseval (Kilgarriff, 1998)] is an ongoing evaluation project which is used as a

baseline to assess various WSD methods, including many which will be examined in this paper.

Linked Linguistic Open Data (LLOD) is heavily dependent on metadata, and any

consideration thereof would require an examination of its standards. A brief history of the topic

of linguistic annotation can be found in Palmer & Xue (2013). Bird & Simons (2003a) and Ide,

Romary, & de la Clergerie (2004) proposed sets of best practices for linguistic annotations, while

Simons, Bird, & Spanne (2008) offered a more recent set of recommendations that specifically

suggested language codes from ISO 639-31 be used in metadata. Ide & Pustejovsky (2010)

suggested a list of best practices for language technology metadata, focusing heavily on the work

of the OLAC and European Languages Resource Association (ELRA). Gracia, Montiel-

Ponsoda, Cimiano, Gómez-Pérez. Buitelaar, & McCrae (2012) considered the issue of Linked

Data being stored in different languages, and suggested that techniques such as ontology

localization, ontology mapping, and cross-lingual ontology-based information access and

presentation would help prevent information from being locked up in linguistic data silos. Gayo,

Kontokostas, & Auer (2013) presented a set of best practices for multilingual linked open data,

and point out that SPARQL queries can be improved if tags are identified by language. Reviews

of specific linguistic annotation schemes include: the Open Languages Archives Community

[OLAC] metadata set (Bird & Simons,2003b); the General Ontology for Linguistic Description

[GOLD] (Farrar & Langendoen; 2003); ISOcat, a Data Category Registry (DCR) for the ISO TC

37 (terminology and other language and content resources) registry (Kemps-Snijders,

1 http://www-01.sil.org/iso639-3/

Page 4: Metadata And Linked Data in Word Sense Disambiguation

METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 3

Windhouwer, Wittenburg, & Wright (2009); the ISO/TC 37/SC 4 standard (Lee & Romary,

2010); the lemon (LExicon Model for ONtologies) model (McCrae, Aguado-de-Cea, Buitelaar,

Cimiano, Declerck, Gómez-Pérez, ... & Wunner, 2012); and Lexical Markup Framework [LMF]

(Francopoulo, 2013), which had a strong influence on the lemon model.

A number of papers detail projects that utilized the schemes listed above. Montiel-

Ponsoda, Gracia del Río, Aguado de Cea, & Gómez-Pérez (2011) showed how the lemon model

can be extended using a metamodel in OWL, which would allow translation to be represented on

a separate layer. Buitelaar, Cimiano, Haase, & Sintek (2009) advocated using ontologies beyond

those of RDFS, OWL, and SKOS, and presented a model called LexInfo, which combines

aspects of older models. Chiarcos, Dipper, Götze, Leser, Lüdeling, Ritz, & Stede (2008) treated

the Ontology of Linguistic Annotation, which is especially useful for corpora that have been

annotated a number of different times in a number of different methods.

Two of the most commonly used linguistic tools on the Semantic Web are the general-

purpose lexical ontologies WordNet (Fellbaum, 1998) and FrameNet (Baker, Fillmore, & Lowe,

1998). Although Ide (2014) argued that FrameNet was the “ideal resource for representation as

linked data” (18), the majority of the projects covered later in this paper utilized WordNet, and

thus this tool will be examined in more detail. Both FrameNet and WordNet are often used in

Linked Data projects to automatically annotate texts with semantic metadata. Projects that have

used these databases include Huang (2007), wherein WordNet files were converted to be

presented in OWL to assist in machine comprehension of metaphor; and BabelNet (Navigli,

2012), a resource which will be reviewed later in this paper. Ehrmann, Cecconi, Vannella,

McCrae, Cimiano, & Navigli (2014) converted BabelNet into Linked Data via the lemon model;

and Moro, Navigli, Tucci & Passonneau (2014) used BabelNet to automatically annotate the

Page 5: Metadata And Linked Data in Word Sense Disambiguation

METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 4

Manually Annotated Sub-Corpus 3.0 (MASC) and therewith were able to perform automatic

WSD with an accuracy of 70%, an impressive figure; but still too low to see much practical

adoption.

Other examples of linguistic tools used with Semantic Web technologies include

Krizhanovsky & Smirnov (2013), wherein Wiktionary was utilized to automatically create a

general-purpose lexical ontology; Hellmann, Brekle, & Auer (2013) described a similar project

wherein Wiktionary extractors made use of DBpedia to create RDF triples; de Melo (2014a)

introduced lexvo.org, a system which automatically creates URIs for each word and sense, thus

guaranteeing a constant reference; Mendes, Jakob, García-Silva, & Bizer (2011), introduced

DBpedia Spotlight, an open-source program that automatically annotates texts to the Linked

Open Data cloud by using the URIs in DBpedia; and Sérasset (2014) described the extraction of

multilingual lexical data from Wiktionary, the importation thereof into DBNary, and the final

conversion of the data into MLLOD (Multilingual Lexical Linked Open Data) via the lemon

model.

A number of very different methods of using Semantic Web technologies to disambiguate

word senses have been attempted and analyzed. Elbedweihy, Wrigley, Ciravegna, & Zhang

(2013) used a combination of WordNet, BabelNet, and Wikipedia to help generate SPARQL

queries, which would subsequently resolve ambiguities in the original queries with a success rate

of 76%; Fragos (2013) also used WordNet - in this case the extended glosses of WordNet - to

train WSD systems; McCarthy et al. (2004) used the WordNet similarity package and raw textual

corpora to solve WSD by using the predominant sense of the word, a method which achieved a

success rate of 64%; Ide (2006) treated the problem of polysemy by mapping FrameNet sets to

WordNet.

Page 6: Metadata And Linked Data in Word Sense Disambiguation

METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 5

Some case study reviews show the strengths and weaknesses of more general models:

Haase (2004) looked at tags for digital images to argue that semantic metadata can help alleviate

some of the issues of precision caused by selecting overly narrow terms; de Melo & Weikum

(2008) argued that “language-related knowledge” forms the backbone of the semantic web, and

presented ways in which linguistic items such as languages, scripts, and terms can

unambiguously be linked with URIs, and from whence new links can automatically be formed;

and Tagarelli, Longo, & Greco (2009) showed how notions of sense relatedness can be

calculated by examining overlaps between dictionary glosses and measuring distances for

ontology paths.

The rest of the paper will cover in more detail several tools and models that feature

prominently in the use of metadata to disambiguate word senses. As WordNet comes up so often

in this paper, a rudimentary comprehension of the workings of the lexicon will be necessary to

fully understand how other tools covered in this paper work. The subsequent sections will look

at lexvo.org; the lemon model; BabelNet; WordNet++; and finally tag disambiguation with

TAGora Sense Repository. This paper will conclude with a brief examination of some of the

issues of relying on meta- and linked data for Word Sense Disambiguation.

Page 7: Metadata And Linked Data in Word Sense Disambiguation

METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 6

A figure showing linked resources available in the Linguistic Linked Open Data Cloud.

Taken from http://linguistics.okfn.org/llod on November 19, 2014.

Page 8: Metadata And Linked Data in Word Sense Disambiguation

METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 7

WordNet

WordNet (Fellbaum, 1998) is a general-purpose lexical ontology that features prominently in the

web of linked data. In WordNet, words are organized according to Synset, which include not

only synonyms, but also hypernyms (broader categories), hyponyms (narrower categories),

meronyms (part-to-whole relationships), and more. Classifying words by hyper- and hyponyms

allows entries to be nested hierarchically, which can assist in determining the correct sense of a

word. WordNet is free to use and fully queriable, and below are screenshots showing the results

for a query of the word play, a highly polysemous word which will again be considered in detail

when we examine BabelNet.

Basic results of a query for the word “play” at WordNet. As one can easily see, “play” is highly polysemous, and most senses of the word are listed with their accompanying Synsets. Accessed November 19, 2014 at http://bit.ly/1uSKI0e

Drilling down on the first sense of the word “play,” we can see a list of hyponyms (narrower forms), as well as a detailed hierarchy showing the hypernyms this sense of “play” is nested in, going all the way to “entity”, which is the listing for all nouns in WordNet. Accessed November 19, 2014 at http://bit.ly/1xQxMrV

Page 9: Metadata And Linked Data in Word Sense Disambiguation

METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 8

In addition to organizing words in “Synsets,” WordNet assigns a “sense key” to each

sense of a word, a fact that will be exploited to great effect in BabelNet. For example, the sense

of “play” indicating a dramatic performance would be listed as play1n, with the superscript “1”

indicating that this is the first sense of “play” listed in WordNet, and the subscript “n” indicating

that this instance of “play” is a noun. Play1n would be in the same Synset as drama1

n and

dramatic play1n, which are also the first senses listed for their respective terms, and nouns as

well. The sense of the word “play” referring to children’s games would be listed as play8n, and

in the same Synset as child’s play2n, as they are respectively the eighth and second sense of

their terms (Navigli & Ponzetto, 2012a). Besides being used in the projects listed below,

WordNet is also used by Wikipedia, as all pages at Wikipedia have WordNet senses

automatically associated with them (Ponzetto & Navigli, 2010). However, one criticism of

WordNet is that many senses listed are too similar, which may limit the usefulness of WordNet

in WSD tasks (Ide, 2006).

Various metadata schemes have been used to link the information contained within the

WordNet ontology with other databases and ontologies. There is a nearly complete XML

version of WordNet2, and an RDF version

3 which was structured according to the lemon model

(cf. below). There exists as well a mapping of WordNet entries to schema.org terms4. The World

Wide Web Consortium also maintains extensive documentation regarding RDF/OWL

representations of WordNet (van Assem, Gangemi, & Schreiber, 2006). Finally, entries in

2 http://wordnet.princeton.edu/wordnet/download/standoff/

3 http://wordnet-rdf.princeton.edu/

4 http://schema.rdfs.org/mappings/schemaorg_wn.owl

Page 10: Metadata And Linked Data in Word Sense Disambiguation

METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 9

WordNet are semantically linked to a number of other LLOD resources, including lexvo.org (cf.

below), and VerbNet.

Lexvo.org

Lexvo.org (de Melo, 2014a) is an easy-to-use service which provides URIs to identify any term

in a given language. These URIs “serve as interchange URIs that can easily be created from any

word-segmented natural language text” (de Melo, 2014a, 3). Unambiguous reference to a

specific language is achieved by using ISO639-3 codes, and lexvo.org records also give the parts

of speech for the various senses, and provide links to translations in other languages. By

referring to what language the term is from, lexvo.org can help assist in multilingual

disambiguation, which can

The results for the English word “play.” Accessed November 21, 2014 at http://www.lexvo.org/page/term/eng/play

help prevent some accidental, embarrassing results5. In addition to linking individual senses of a

word to WordNet synsets (copies of which are hosted at lexvo.org), records at lexvo.org are

5 Problems can occur when words in foreign languages have the same spellings but different meanings.

E.g. the German word Mist is considered a faux ami (false friend), as its meaning is very different from the

orthographically identical word in English – der Mist would be probably be translated as “crap” in English.

Page 11: Metadata And Linked Data in Word Sense Disambiguation

METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 10

linked to Library of Congress Subject Headings; records at OpenCyc6; and linked internally at

lexvo.org to translations for the various senses of the word (e.g., here is the URI for the word jeu,

and here is the IRI7 for pièce de théâtre, two French words which capture two of the senses

connoted by the English play). However, it should be pointed out that when I retrieved the

record for “play,” a number of URIs linked to it had expired – a problem with linked data that I

will explore again at the conclusion of this paper. Lexvo.org uses sources such as Wikipedia,

Wiktionary, and the Unicode CLDR (Common Locale Data Repository)8 to supply descriptions

of all the languages it covers. In 2008, Lexvo.org became the first web site to publish Linked

Data based on Wiktionary on the Web (de Melo, 2014a), and it also utilizes the URIs created for

words in the WordNet lexical RDF/OWL Representation of WordNet (van Assem, Gangemi, &

Schreiber, 2006). Lexvo.org can help free data from “data silos” by providing unambiguous

URIs for individual languages, terms within, and even senses of these terms. Lexvo.org links are

used by the British Library in their British National Bibliography data; the Spanish National

Library; Sudoc – the French academic catalog; LOCAH Linked Archives Hub project, and

DBpedia Spotlight (de Melo, 2014b)

Lemon

Lemon (LExicon Model for Ontologies) “is designed to represent lexical information about

words and terms relative to an ontology on the Web” (McCrae et al., 2012, 703), and is what is

referred to as an ontology-lexicon, which shows how the classes of the ontology are realized

6 http://www.cyc.com/platform/opencyc

7 Technically, URIs cannot contain non-ASCII characters (Cyganiak, Wood, & Lanthaler, 2014), and since

pièce de théâtre contains diacritics, the URL for it would be considered an IRI.

8 http://cldr.unicode.org/

Page 12: Metadata And Linked Data in Word Sense Disambiguation

METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 11

linguistically (Buitelaar, 2010). Lemon follows a principle referred to as semantics by reference,

in that “the (lexical) meaning of the entries in the lexicon is assumed to be expressed exclusively

in the ontology and the lexicon merely points to the appropriate concepts” (McCrae et al., 2012,

703). Lemon is similar to the SKOS project (Miles and Bechhofer, 2009), but differs in that it is

“an independent and external model, intended to be published with arbitrary ontology-based

conceptualisations [sic] … in order to provide a richer description of the knowledge captured in

those resources in one or several natural languages” (McCrae et al., 2012, 703).

Lemon is an RDF-native form, which allows it to exploit existing Semantic Web technologies.

Drilling down on one of the results of a query for “play”. Accessed November 19, 2014 at http://lemon-model.net/lexica/uby/wn/WN_LexicalEntry_46245

The intention of the lemon model is not to be a semantic model, but instead it references existing

resources (e.g., ontologies) which gives it the capability to represent semantics. The model does

this through its “(lexical) sense” object, which avoids using the concept of a word sense that is

commonly found in many existing models, a practice that Kilgariff (1997) criticized in his paper

Page 13: Metadata And Linked Data in Word Sense Disambiguation

METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 12

about the issues created by word senses (McCrae et al., 2014). The lemon model is used by a

great number of linked data initiatives, including BabelNet, which will be considered later in this

paper (Ehrmann et al., 2014).

A diagram of the lemon model

from Ehrmann et al., 2014, 404

Page 14: Metadata And Linked Data in Word Sense Disambiguation

METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 13

BabelNet

Simply put, BabelNet9 (Navigli & Ponzetto, 2012a) is a combination of the structure of

Wikipedia augmented with Synsets from WordNet. The two databases were integrated by an

automatic mapping, and any lexical gaps in resource-poor languages were plugged by using

machine translation. The resulting “encyclopedic dictionary” provides lexical information on

concepts and named entities in many languages, and the end result is a semantically-linked

ontology. From version 2.0 onwards, the database was also linked with the Open Multilingual

WordNet [OMWN] (Bond & Foster, 2013), a collection of wordnets in different languages, and

OmegaWiki10

, a collaborative dictionary that is available in a number of different languages.

The 2.0 version also had more than 9 million Babel synsets (i.e. entries linked in semantic

networks) in over 50 languages (Ehrmann et al., 2014; Navigli, 2014). Currently, version 2.5 is

in beta mode.

The knowledge contained within these Babel synsets can be used to perform knowledge-

rich (Navigli & Velardi, 2005), graph-based (Bird & Liberman, 2001) Word Sense

Disambiguation in both monolingual and multilingual settings. By utilizing the partial structure

of a typical Wikipedia page, BabelNet is able to extract semantic information regarding an entry.

For example, by using Wikipedia’s redirect and disambiguation pages, internal and multilingual

links, and category designations as well, BabelNet is able to automatically glean significant

semantic information regarding an entry [cf. graph below] (Navigli & Ponzetto, 2012a; Navigli

& Ponzetto, 2012b).

9 babelnet.org

10 http://www.omegawiki.org/Meta:Main_Page

Page 15: Metadata And Linked Data in Word Sense Disambiguation

METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 14

From Navigli & Ponzatto, 2012, 220

The connections in WordNet reimagined as a graph. In this graph, the node would be the Synset, and the edges the lexical and semantic relations between terms. Remark the superscript numbers revealing the sense of a polysemous word, and the subscript letter indicating the part of speech.

The connections in Wikipedia reimagined as a graph. Here the edges are the hyperlinks between entries. For the sake of conciseness only a small portion of the graph is shown, but notice the inclusion of “tragedy” in both graphs; BabelNet was able to deduce these two entries referred to the same sense by calculating the percentage of words in common.

BabelNet uses a mapping algorithm to determine which sense of a polysemous WordNet entry

should be paired with which disambiguated Wikipedia entry. In the example illustrated above,

WordNet’s play1n would be paired with Wikipedia’s PLAY(THEATRE) entry since the graphs

for two entries share more words in common than the graphs for the other choices. Concerning

Word Sense Disambiguation, the gains in precision over rival methods may be quite small, but

the gains in recall are fairly high. Furthermore, BabelNet can also take advantage of the

multilingual links in Wikipedia to assist even in monolingual WSD, as can be seen in the figure

below (Navigli & Ponzetto, 2012a). There also exists an RDF version of BabelNet 2.0 which

contains about 1.1 billion triples, among which there are over 9 million SKOS concepts, and

nearly 100 million lemon lexical senses (Ehrmann et al., 2014). BabelNet can currently be used

Page 16: Metadata And Linked Data in Word Sense Disambiguation

METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 15

as a stand-alone resource with its Java API, a SPARQL endpoint, or as a Linked Data interface

as part of the Linguistic Linked Open Data (LLOD) cloud (Navigli, 2014). A wiki on converting

BabelNet as Linguistic Linked is also maintained by the World Wide Web Consortium11

.

From Navigli & Ponsetto, 2012, 221

BabelNet takes the linked data from Wikipedia and WordNet to create a Babel Synset, which would also include translations of the word in various languages. Having access to such translations can even be used to assist in monolingual WSD.

11

http://www.w3.org/community/bpmlod/wiki/Converting_BabelNet_as_Linguistic_Linked_Data

Page 17: Metadata And Linked Data in Word Sense Disambiguation

METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 16

Results of a query for the word “play” at BabelNet 2.5, which is currently in beta mode. Accessed November 20, 2014 at http://babelnet.org/exploreResult?word=play&lang=EN

Drilling down to the first result of the query above. Accessed November 20, 2014 at http://babelnet.org/search?word=bn:00028604n&details=1&orig=play&lang=EN

Continuation of the web page displayed on the left, showing WordNet senses and definitions, as well as redirections and definitions from Wikipedia

Page 18: Metadata And Linked Data in Word Sense Disambiguation

METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 17

WordNet++

WordNet++ (Ponzetto & Navigli, 2010) is an English-only subset of BabelNet in which entries

from Wikipedia are mapped to the corresponding senses in WordNet. For example, by looking

at the disambiguation links for the Wikipedia entry SODA (SOFT DRINK), we can construct a

disambiguation context which includes the words: soft, drink, cola, sugar. The possible matches

at WordNet include soda1n, which includes words like salt, acetate, chlorate, and benzoate.

The context for soda2n includes soft, drink, cola, bitter, etc…. Having the largest intersection

between them, WordNet++ would match Wikipedia’s SODA(SOFT DRINK) with WordNet’s

soda2n. Furthermore, WordNet++ can use the additional links in Wikipedia (e.g. SODA(SOFT

DRINK) is linked to the entry SYRUP to create even larger Synsets. This method generated

consistently high results on several Semeval tasks (Ponzetto & Navigli, 2010).

Tag Disambiguation with DBpedia

Delicious12

allows users to assign text descriptions to user-contributed tags, but these can’t

readily be used by machines (Garcia, Szomszor, Alani, & Corcho, 2009). Garcia et al.

developed an approach that uses the TAGora Sense Repository (TSR)13

, a linked data resource

that provides metadata about tags and their possible senses, to disambiguate the mostly like sense

of a word. The TSR is ultimately linked to DBpedia (Morsey, Lehmann, Auer, Stadler, &

Hellmann, 2012), a representation in the RDF model of a portion of the information in

Wikipedia. This approach faced one difficultly that many other WSD systems do not face,

12

https://delicious.com/

13 Currently at http://www.tagora-project.eu/ The URL listed in Garcia et al. 2009 did not work at the time

of writing.

Page 19: Metadata And Linked Data in Word Sense Disambiguation

METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 18

namely tags do not occur in sentences, and therefore this approach could not examine the word

in its context to disambiguate it. However, tags seldom occur singularly, and this method was

able to look at the other tags in order to try to determine the most likely sense. Garcia et al.

(2009) created an algorithm which represented the tags and the context in vectors, and then used

similarity measures to choose the most likely sense. The authors stated that this approach was

designed to be used for tag disambiguation, but could be used for other functions as well.

Conclusion

Like many other activities in the realm of artificial intelligence, Word Sense

Disambiguation is a task that is notoriously difficult for machines, but relatively simple for

humans. Successful methods for disambiguating polysemous words automatically have been

developed, but these methods are heavily reliant on marked-up corpora, which require significant

investments of time and money (Edmonds, 2000). Linked Open Data, however, is allowing

computers to exploit semantically marked-up data in an attempt to share resources, and thus we

can use already discovered knowledge to solve hitherto unseen problems. This is only possible,

of course, because of a realization of the importance of interoperability and a commitment to

shared standards.

But a dependence on interconnected data has created a new set of challenges. One huge

problem with Linked Data is the quality of links. The ever-changing nature of information

means that websites with static information are at a disadvantage, for if the information

contained within the page ever changes, it has to be changed manually to remain current. While

this problem is obviated by using Linked Data, a new problem arises if the link disappears

without notice. In my cursory explorations of several tools examined above, I came across

Page 20: Metadata And Linked Data in Word Sense Disambiguation

METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 19

several dead links, which naturally caused me to question the practicality of these tools. De

Melo (2014), in introducing lexvo.org, explained that he was reluctant to use dynamic

information from Wiktionary, as the site changes frequently. It seems entirely possible that even

a slight structural change at a site like Wiktionary could wreak havoc with the resources linked to

it, and it is not certain that the administrators at Wiktionary will take these consequences into

account when debating alterations. Furthermore, there presently seems to be no mechanism to

prune and replace dead links, and such a deficit seems to be a liability for any service relying on

linked data.

Another concern is that none of these standards or tools will gain purchase, and we could

end up with a multitude of systems competing with and failing to operate with each other. While

a model like Lemon seems to have gained wide use, BabelNet is a relatively young project, and

may drift towards obsolescence like the projects WiSeNEt (Moro & Navigli, 2012) and MENTA

(de Melo & Weikum, 2010), two recent similar projects that are scarcely mentioned today.

Nonetheless, the exponentially increasing amount of data available through Linked Open

Data seems to suggest that WSD systems will come to increasingly rely on it, and on the

metadata they will use to locate the sought-after information. Naturally, such linking would not

be possible without a commitment to shared standards, and the success of these linked data tools

can be considered another victory for the virtues of interoperability. However, in our rush to

connect everything, it would be advisable to regularly examine the quality of the data we are

linking.

Page 21: Metadata And Linked Data in Word Sense Disambiguation

METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 20

References

Agirre, E., & Edmonds, P. G. (ed.). (2007). Word Sense Disambiguation: Algorithms and

Applications. Berlin: Springer.

Baker, C. F., Fillmore, C. J., & Lowe, J. B. (1998). The Berkeley Framenet project. In

Proceedings of the 36th Annual Meeting of the Association for Computational Linguistics

and 17th International Conference on Computational Linguistics-Volume 1 (pp. 86-90).

Association for Computational Linguistics.

Bird, S., & Liberman, M. (2001). A formal framework for linguistic annotation.

arXiv:cs/9903003v1

Bird, S., & Simons, G. (2003a). Extending Dublin Core metadata to support the description and

discovery of language resources. Computers and the Humanities, 37(4), 375-388.

Bird, S., & Simons, G. (2003b). Seven dimensions of portability for language documentation and

description. Language, 79, 557-582. Accessed October 21, 2014 at

http://www.jstor.org.libaccess.sjlibrary.org/stable/4489465

Bond, F. & Foster, R. (2013). Linking and extending an Open Multilingual Wordnet. In Proc. of

51st Annual Meeting of the Association for Computational Linguistics, 1352–1362.

Buitelaar, P. (2010). Ontology-based semantic lexicons: Mapping between terms and object

descriptions. In: C.R. Huang, N. Calzolari, A. Gangemi, A. Lenci, A. Oltramari & L.

Prevot (Eds.), Ontology and the Lexicon, pp. 212–223. Cambridge: Cambridge

University Press.

Buitelaar, P., Cimiano, P., Haase, P., & Sintek, M. (2009). Towards linguistically grounded

ontologies. In The Semantic Web: Research and Applications, pp. 111-125. Berlin:

Springer

Berners-Lee, T. (2006). Linked Data. Accessed September 29, 2014 at

http://www.w3.org/DesignIssues/LinkedData.html

Chiarcos, C., Dipper, S., Götze, M., Leser, U., Lüdeling, A., Ritz, J., & Stede, M. (2008). A

flexible framework for integrating annotations from different tools and tagsets.

Traitement Automatique des Langues, 49(2), 271-293.

Cyganiak, R.; Wood, D.; Lanthaler, M. (2014) RDF 1.1 Concepts and abstract syntax. Accessed

October 7, 2014 at http://www.w3.org/TR/2014/REC-rdf11-concepts-20140225/#dfn-iri

de Melo, G. & Weikum, G. (2010). MENTA: Inducing multilingual taxonomies from Wikipedia,

in: Proceedings of the Nineteenth ACM Conference on Information and, Knowledge

Management, pp. 1099-1108. Toronto, Canada, 26–30.

de Melo, G. (2014a). Lexvo.org: Language-related information for the linguistic linked data

cloud. Semantic Web Journal. Accessed October 15, 2014 at http://www.semantic-web-

journal.net/system/files/swj521.pdf

de Melo, G. (2014b). Lexvo.org main page. Accessed November 18, 2014 at www.lexvo.org

de Melo, G., & Weikum, G. (2008, October). Language as a foundation of the semantic web. In

International Semantic Web Conference (Posters & Demos). Accessed October 17, 2014

at https://www.mpi-inf.mpg.de/~gdemelo/papers/demelo-lexvo-iswc2008.pdf

Edmonds, P. (2000). Designing a task for SENSEVAL-2 [Technical report]. Brighton, UK:

University of Brighton.

Ehrmann, M., Cecconi, F., Vannella, D., McCrae, J., Cimiano, P., & Navigli, R. (2014).

Representing multilingual data as Linked Data: the Case of BabelNet 2.0. In Proceedings.

Page 22: Metadata And Linked Data in Word Sense Disambiguation

METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 21

of Language Resource Evaluation Conference. Accessed October 20, 2014 at

http://www.lrec-conf.org/proceedings/lrec2014/pdf/810_Paper.pdf

Elbedweihy, K., Wrigley, S. N., Ciravegna, F., & Zhang, Z. (2013). Using Babelnet in bridging

the gap between natural language queries and linked data concepts. In NLP-DBPEDIA@

ISWC. Accessed October 17, 2014 at http://ceur-ws.org/Vol-

1064/Elbedweihy_Using_BabelNet.pdf

Farrar, S., & Langendoen, D. T. (2003). A linguistic ontology for the semantic web. Glot

International, 7(3), 97-100.

Fellbaum, C. (ed.). (1998). WordNet: An Electronic Lexical Database. Cambridge, MA: MIT

Press.

Fragos, K. (2013). Modeling WordNet glosses to perform word sense disambiguation.

International Journal on Artificial Intelligence Tools, 22(02). Accessed October 24,

2014 at

http://www.worldscientific.com.libaccess.sjlibrary.org/doi/abs/10.1142/S0218213013500

036

Francopoulo, G. (ed.). (2013) LMF Lexical Markup Framework. Hoboken, NJ: Wiley.

Gale, W. A., Church, K., & Yarowsky, D. (1992). Estimating upper and lower bounds on the

performance of word-sense disambiguation programs. In Proceedings of the 30th Annual

Meeting of the Association for Computational Linguistics, pp. 249–256.

Garcia, A., Szomszor, M., Alani, H., & Corcho, O. (2009). Preliminary results in tag

disambiguation using DBpedia. Accessed October 20, 2014 at

http://oro.open.ac.uk/20006/1/ckcar09-final.pdf

Gayo, J. E. L., Kontokostas, D., & Auer, S. (2013). Multilingual linked open data patterns.

Semantic Web Journal Accessed October 15, 2014 at http://www.semantic-web-

journal.net/system/files/swj495.pdf

Gracia, J., Montiel-Ponsoda, E., Cimiano, P., Gómez-Pérez, A., Buitelaar, P., & McCrae, J.

(2012). Challenges for the multilingual web of data. Web Semantics: Science, Services

and Agents on the World Wide Web, 11, 63-71.

Haase, K. (2004) Context for semantic metadata. Proceedings of the 12th Annual ACM

International Conference on Multimedia. ACM, 2004. Accessed October 16, 2014 at

http://www.yamod.ch/media/11026/haase01.pdf

Hellmann, S., Brekle, J., & Auer, S. (2013). Leveraging the crowdsourcing of lexical resources

for bootstrapping a linguistic data cloud. In Semantic Technology, pp. 191-206. Berlin:

Springer.

Huang, X. X. (2007). An OWL-based WordNet lexical ontology. Journal of Zhejiang University

SCIENCE A, 8(6), 864-870.

Ide, N. (2006). Making senses: Bootstrapping sense-tagged lists of semantically-related words.

In Computational Linguistics and Intelligent Text Processing, pp. 13-27. Berlin: Springer.

Ide, N. (2014). FrameNet and Linked Data. In Proceedings of Frame Semantics in NLP: A

Workshop in Honor of Chuck Fillmore (Vol. 1929, pp. 18-21).

Ide, N., & Pustejovsky, J. (2010, January). What does interoperability mean, anyway? Toward an

operational definition of interoperability for language technology. In Proceedings of the

Second International Conference on Global Interoperability for Language Resources.

Hong Kong, China. Accessed October 21, 2014 at

http://www.cs.vassar.edu/~ide/papers/ICGL10.pdf

Page 23: Metadata And Linked Data in Word Sense Disambiguation

METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 22

Ide, N., Romary, L., & de la Clergerie, E. (2004). International standard for a linguistic

annotation framework. Natural Language Engineering, 10(3-4), 211-225.

Jurafsky, D., & Martin, J. H. (2009). Speech and Language Processing: An Introduction to

Natural Language Processing, Computational Linguistics, and Speech Recognition.

Upper Saddle River, NJ: Prentice Hall.

Kemps-Snijders, M., Windhouwer, M., Wittenburg, P., & Wright, S. E. (2009). ISOcat:

Remodelling metadata for language resources. International Journal of Metadata,

Semantics and Ontologies, 4(4), 261-276.

Kilgarriff, A. (1997). I don’t believe in word senses. Computers and the Humanities, 31(2), 91–

113.

Kilgarriff, A. (1998, May). Senseval: An exercise in evaluating word sense disambiguation

programs. In Proceedings of the First International Conference on Language Resources

and Evaluation, pp. 581-588. Accessed October 24, 2014 at

http://citeseerx.ist.psu.edu/viewdoc/download;jsessionid=3CDF14935273C9C8D3D5BF

AD0964F349?doi=10.1.1.32.1931&rep=rep1&type=pdf

Krizhanovsky, A. A., & Smirnov, A. V. (2013). An approach to automated construction of a

general-purpose lexical ontology based on Wiktionary. Journal of Computer and Systems

Sciences International, 52(2), 215-225.

Kwong, O. Y. (2013). New Perspectives on Computational and Cognitive Strategies for Word

Sense Disambiguation. Berlin: Springer.

Lee, K., & Romary, L. (2010). Towards interoperability of ISO standards for Language Resource

Management. Proc. ICGL 2010.

Mallery, J. C. (1988). Thinking about foreign policy: Finding an appropriate role for artificial

intelligence computers. Ph.D. dissertation. Cambridge, MA: MIT Political Science

Department.

Manning, C.D., & Schütze, H. (1999). Foundations of Statistical Natural Language Processing.

Cambridge, MA: MIT Press.

McCarthy, D., Koeling, R., Weeds, J., & Carroll, J. (2004, July). Finding predominant word

senses in untagged text. In Proceedings of the 42nd Annual Meeting on Association for

Computational Linguistics (p. 279). Association for Computational Linguistics.

Accessed October 24, 2014 at http://sro.sussex.ac.uk/1210/1/senseranks.pdf

McCrae, J., Aguado-de-Cea, G., Buitelaar, P., Cimiano, P., Declerck, T., Gómez-Pérez, A., ... &

Wunner, T. (2012). Interchanging lexical resources on the Semantic Web. Language

Resources and Evaluation, 46(4), 701-719.

Mendes, P. N., Jakob, M., García-Silva, A., & Bizer, C. (2011, September). DBpedia spotlight:

Shedding light on the web of documents. In Proceedings of the 7th International

Conference on Semantic Systems, pp. 1-8. ACM. Accessed October 21, 2014 at

http://oa.upm.es/11477/2/INVE_MEM_2011_105377.pdf

Miles, A., & Bechhofer, S. (2009). SKOS simple knowledge organization system reference.

http://www.w3.org/TR/skos-reference/. Accessed November 19, 2014.

Montiel-Ponsoda, E., Gracia del Río, J., Aguado de Cea, G., & Gómez-Pérez, A. (2011).

Representing translations on the semantic web. Accessed October 17, 2014 at

http://oa.upm.es/10295/1/Representing.pdf

Moro, A. & Navigli, R. (2012). WiSeNet: Building a Wikipedia-based semantic network with

ontologized relations, in: Proceedings of the 21st ACM Conference on Information and

Knowledge Management. Maui, Hawaii.

Page 24: Metadata And Linked Data in Word Sense Disambiguation

METADATA AND LINKED DATA IN WORD SENSE DISAMBIGUATION 23

Moro, A., Navigli, R., Tucci, F. M., & Passonneau, R. J. (2014). Annotating the MASC Corpus

with BabelNet. In Proceedings of LREC. Accessed October 16, 2014 at

http://wwwusers.di.uniroma1.it/~moro/MoroEtAL_LREC2014.pdf

Morsey, M.; Lehmann, J.; Auer, S.; Stadler, C.; & Hellmann, S. (2012). DBpedia and the live

extraction of structured data from Wikipedia. Program, 46(2), 157-181. DOI:

10.1108/00330331211221828

Navigli, R. & Ponzetto, S. P. (2012a). BabelNet: The automatic construction, evaluation and

application of a wide-coverage multilingual semantic network. Artificial Intelligence, 193,

217-250.

Navigli, R. & Ponzetto, S. P. (2012b). Joining forces pays off: multilingual joint word sense

disambiguation. In Proceedings of EMNLP, pp. 1399–1410. Jeju, Korea.

Navigli, R., & Velardi, P. (2005). Structural semantic interconnections: A knowledge-based

approach to word sense disambiguation. Pattern Analysis and Machine Intelligence,

IEEE Transactions on, 27(7), 1075-1086.

Navigli, R. (2009). Word sense disambiguation: A survey. ACM Computing Surveys (CSUR),

41(2), 10.

Navigli, R. (2012). BabelNet goes to the (Multilingual) Semantic Web. In: ISWC 2012 Workshop

on Multilingual Semantic Web. Accessed October 21, 2014 at http://ceur-ws.org/Vol-

936/paper1.pdf

Navigli, R. (2014). About Babelnet. Accessed November 21, 2014 at http://babelnet.org/about

Palmer, M. & Xue, N. (2013). Linguistic annotation. In Clark, A., Fox, C., & Lappin, S. (eds.).

The Handbook of Computational Linguistics and Natural Language Processing, West

Sussex, UK: John Wiley & Sons.

Ponzetto, S. P. & Navigli, R. (2010). Knowledge-rich word sense disambiguation rivaling

supervised systems. In: Proceedings of the 48th Annual Meeting of the Association for

Computational Linguistics, Association for Computational Linguistics, pp. 1522-1531.

Stroudsburg, PA: Association for Computational Linguistics.

Sérasset, G. (2014). DBnary: Wiktionary as a Lemon-based multilingual lexical resource in RDF.

Semantic Web Journal-Special issue on Multilingual Linked Open Data. Accessed

October 24, 2014 at http://www.semantic-web-journal.net/system/files/swj648.pdf

Simons, G., Bird, S., & Spanne, J. (2008). Best practice recommendations for language resource

description. Accessed October 21, 2014 at http://www.language-

archives.org/REC/bpr.html

Tagarelli, A., Longo, M., & Greco, S. (2009). Word sense disambiguation for XML structure

feature generation. In The Semantic Web: Research and Applications (pp. 143-157).

Berlin: Springer.

van Assem, M., Gangemi, A. & Schreiber, G. (2006). RDF/OWL Representation of WordNet.

W3C Working Draft, World Wide Web Consortium. http://www.w3.org/TR/wordnet-rdf/