Indexing With WordNet Synsets Can Improve Text Retrieval

Embed Size (px)

Citation preview

  • 7/27/2019 Indexing With WordNet Synsets Can Improve Text Retrieval

    1/7

    I n d e x i n g w i t h W o r d N e t s y n s e t s c a n i m p r o v e t e x t r e tr i e v a lJulio Gonzalo and Felisa Verdejo and Ir ina Chug ur and J ua n Cigarr~inUNEDCiudad Universitaria, s.n.28040 Madrid - Spain

    { j u l i o , f e l i s a , i r i n a , j u a n c i } @ i e e c , u n e d . e s

    A b s t r a c tThe classical, vector space model for text retrievalis shown to give better results (up to 29% better inour experiments) if WordNet synsets are chosen asthe indexing space, instead of word forms. This re-sult is obtained for a manually disambiguated testcollection (of queries and documents) derived fromthe SEMCORsemantic concordance. The sensitiv-ity of retrieval performance to (automatic) disam-biguation errors when indexing documents is alsomeasured. Finally, it is observed tha t if queries arenot disambiguated, indexing by synsets performs (atb e s t ) o n l y a s g o o d a s s t a n d a r d w o r d i n d ex i n g .1 I n t r o d u c t i o nT e x t r e tr i ev a l d e a l s w i t h t h e p r o b l e m o f f i n d i n g al lt h e r e l e v a n t d o c u m e n t s i n a t e x t c o l l e c t i o n f o r ag i v e n u se r 's q u e r y . A l a r ge - s ca l e s e m a n t i c d a t a b a s es u c h a s W o r d N e t ( M il le r , 1 9 9 0 ) s e e m s t o h a v e a g r e atp o t e n t i a l f o r t h i s t a s k . T h e r e a r e , at l e as t , t w o o b -v i o u s r e a s o n s :

    I t o f fe r s t h e p o s s i b il i t y t o d i s c r i m i n at e w o r ds e ns e s i n d o c u m e n t s a n d q u e r ie s . T h i s w o u l dp r e v en t m a t c h i n g s p r i n g i n it s " m e t a l d e v i c e "s e ns e w i t h d o c u m e n t s m e n t i o n i n g s p r i n g i n t h es e n s e o f s p r i n g t i m e . A n d t h e n r e t ri e va l a c c u -r a c y co u l d b e i m p r o v e d .

    W o r d N e t p r ov i de s t h e c h a n c e o f m a t c h i n g s e-m a n t i c a l l y r e l a t e d w o r d s . F o r i n s t a n c e , s p r i n g ,f o u n t a i n , o u t f l o w , o u t p o u r i n g , i n t h e a p p r o p r i -a t e s e n se s , c a n b e i d e n t if i e d a s o c c u r r e n c e s o ft h e s a m e c o n ce p t , ' n a t u r a l f l o w o f g r o u n d w a -t e r ' . A n d b e y o n d s y n o n y m y , W o r d N e t c a n b eu s e d t o m e a s u r e s e m a n t i c d i s t a nc e b e t w e e n o c -c u r ri n g t e r m s t o g e t m o r e s o p h is t i ca t e d w a y s o fc o m p a r i n g d o c u m e n t s a n d q u e ri e s .

    H o w e v e r , t h e g e n e r a l fe e l i n g w i t h i n t h e i n f o r m a -t i o n r et r ie v al c o m m u n i t y i s t h a t d e a l i n g e x p l ic i t lyw i t h s e m a n t i c i n f o r m a t i o n d o e s n o t i m p r o v e s ig n if -i c a n t l y t h e p e r f o r m a n c e o f t e x t r e t r i e v a l s y s t e m s .T h i s i m p r es s i on i s f o u n d e d o n t h e r e s ul t s o f s o m ee x p e r i m e n t s m e a s u r i n g t h e r o l e o f W o r d S e n s e D i s -a m b i g u a t i o n ( W S D ) f or t e x t r et r ie v al , o n o n e h a n d ,

    and some attempt s to exploit the features of Word-Net and other lexical databases, on the other hand.In (Sanderson, 1994), word sense ambiguity isshown to produce only minor effects on retrieval ac-curacy, apparently confirming that query/doc umentmatching strategies already perform an implicit dis-ambiguation. Sanderson also estimates th at if ex-plicit WSD is performed with less than 90% accu-racy, the results are worse than non disambiguatingat all. In his experimental setup, ambiguity is in-troduced artificially in the documents, subs titut ingrandomly chosen pairs of words (for instance, ba-nana and kalashnikov) with artificially ambiguousterms (banana/kalashnikov). While his results arevery interesting, it remains unclear, in our opinion,whether they would be corroborated with real oc-currences of ambiguous words. There is also otherminor weakness in Sanderson's experiments. Whenhe ~disambiguates" a term such as spr ing /bank toget, for instance, bank , he has done only a partialdisambiguation, as bank can be used in more thanone sense in the text collection.Besides disambiguation, many att emp ts have beendone to exploit WordNet for text retrieval purposes.Mainly two aspects have been addressed: the enrich-ment of queries with semantically-related terms, onone hand, and the comparison of queries and doc-uments via conceptual distance measures, on theother.Query expansion with WordNet has shown to bepotentially relevant to enhance recall, as it permitsmatching relevant documents tha t could not containany of the query terms (Smeaton et al., 1995). How-ever, it has produced few successful experiments.For instance, (Voorhees, 1994) manua lly expanded

    50 queries over a TREC-1 collection (Harman, 1993)using synonymy and other semantic relations fromWordNet 1.3. Voorhees found that the expansionwas useful with short, incomplete queries, and ratheruseless for complete topic statements -where otherexpansion techniques worked better-. For shortqueries, it remained the problem of selecting the ex-pansions automatically; doing it badly could degraderetrieval performance rather than enhancing it. In

    38

  • 7/27/2019 Indexing With WordNet Synsets Can Improve Text Retrieval

    2/7

    (Richardson and Smeaton, 1995), a combination ofrather sophisticated techniques based on WordNet,including automatic disambiguation and measures ofsemantic relatedness between query/document con-cepts resulted in a drop of effectiveness. Unfortu-nately, the effects of WSD errors could not be dis-cerned from the accuracy of the retrieval strategy.However, in (Smeaton and Quigley, 1996), retrievalon a small collection of image captions - that is, onvery short documents - is reasonably improved us-ing measures of conceptual distance between wordsbased on WordNet 1.4. Previously, captions andqueries had been manually disambiguated againstWordNet. The reason for such success is th at withvery short documents (e.g. boys playing in the sand)the chance of finding the original ter ms of the query(e.g. of children running on a beach) are much lowerthan for average-size documents (that typically in-elude many phrasings for the same concepts). Theseresults are in agreement with (Voorhees, 1994), butit remains the question of whether the conceptualdistance matching would scale up to longer docu-ments and queries. In addition, the experiments in

    . (Smeaton and Quigley, 1996) only consider nouns,while WordNet offers the chance to use all open-classwords (nouns, verbs, adjectives and adverbs).Our essential retrieval strat egy in the experimentsreported here is to adapt a classical vector modelbased system, using WordNet synsets as indexingspace instead of word forms. This approach com-bines two benefits for retrieval: one, that terms axefully disambiguated (this should improve precision);and two, that equivalent terms can be identified (thisshould improve recall). Note that query expansiondoes not satisfy the first condition, as the terms usedto expand are words and, therefore, are in turn am-biguous. On the other hand, plain word sense dis-ambiguation does not satisfy the second condition.as equivalent senses of two different words are notmatched. Thus, indexing by synsets gets maximummatching and minimum spurious matching, seeminga good starting point to study text retrieval withWordNet.Given this approach, our goal is to test twomain issues which are not clearly answered -to ourknowledge- by the experiments mentioned above:

    Abstracting from the problem of sense disam-biguation, what potential does WordNet offerfor text retrieval? In particula r, we would liketo extend experiments with manually disam-biguated queries and documents to average-sizetexts.

    Once the potential of WordNet is known for amanually disambiguated collection, we want totest the sensitivity of retrieval performance todisambiguation errors introduced by automatic

    WSD.This paper reports on our first results answeringthese questions. The next section describes the testcollection that we have produced . The experimentsare described in Section 3, and the last Section dis-

    cusses the results obtained.2 Th e t e s t c o l l e c t i o nThe best-known publicly available corpus hand-tagged with WordNet senses is SEMCOR (Miller etal ., 1993), a subset of the Brown Corpus of about100 documents th at occupies about 11 Mb. (in-cluding tags) The collection is rather heterogeneous,covering politics, sports, music, cinema, philosophy,excerpts from fiction novels, scientific texts... Anew, bigger version has been made available recent ly(Landes et al., 1998), but we have not still adaptedit for our collection.We have adapt ed SEMCOR in order to build a testc o l l e c t i on - t h a t w e c a ll I R - S E M C O R - i n f o u r m a n u a ls t e p s :

    W e h a v e s p l i t t h e d o c u m e n t s t o g e t c o h e r e n tchunks of text for retrieval. We have obtained171 fragments that constitute our text collec-tion, with an averagv length of 1331 words perfragment.

    We have extended the original T O P I C tags ofthe Brown Corpus with a hierarchy of subtags,assigning a set of tags to each text in our col-lection. This is not used in the experimentsreported here.

    We have writ ten a summa ry for each of the frag-ments, with lengths varying between 4 and 50words and an average of 22 words per summary .Each summary is a human explanation of thetext contents, not a mere bag of related key-words. These summaries serve as queries onthe text collection, and then there is exactlyone relevant document per query.

    Finally, we have hand-tagged each of thesummaries with WordNet 1.5 senses. Whena word or term was not present in thedatabase, it was left unchanged. In general,such terms correspond to groups (vg. Ful-ton_Coun ty_Grand-Jury) , persons (Cervan tes )or locations (Fulton).

    We also generated a list Of "stop-senses" and a listof "stop-synsets', automatically translating a stan-dard list of stop words for English.Such a tes t collection offers the chance to measurethe adequacy of WordNet-based approaches to IR in-dependently from the d isambigua tor being used, butalso offers the chance to measure the role of auto-matic disambiguation by introducing different rates

    39

  • 7/27/2019 Indexing With WordNet Synsets Can Improve Text Retrieval

    3/7

    Experiment 07o cor rec t do cu me ntr e t r i e v e d i n fi r st p l a c e62.0ndexing by synsetsIndexing by word senses 53.2Indexing by words (basic SMART) 48.0Indexing by synsets with a 5% errors ratio 62.0Id. with 10% errors ra tio 60.8Id. with 20% errors rat io 56.1Id. with 30% errors rat io 54.4Indexing with all possible synse ts (no disambiguation) 52.6Id. with 60% errors rat io 49.1Sy ns et indexing with non-disambi guated queries 48.5Word-Sense indexing with non-disambiguated queries 40.9

    Table 1: Percentage of correct documents retrieved in first placeof "disambignation errors" in the collection. Theonly disadvantage is the small size of the collection,which does not allow fine-grained distinctions in theresults. However, it has proved large enough to givemeaningful statistics for the experiments reportedhere.Although designed for our concrete text retrievaltesting purposes, the resulting databas e could alsobe useful for many other tasks. For instance , it couldbe used to evaluate automatic summarization sys-tems (measuring the semantic relation between themanually written and hand-tagged summaries o f IR-SEMCOR and the output of text summariz ation sys-tems) and other related tasks.3 Th e e x p e r ime n t sWe have performed a number of experiments using astandard vector-model based text retrieval system,SMART (Salton, 1971), and three different indexingspaces: the original terms in the documents (forstandard SMART runs), the word-senses correspond-ing to the document terms (in other words, a man-ually disambiguated version of the documents) andthe WordNet synsets corresponding to the documentterms (roughly equivalent to concepts occurring inthe documents).

    These are all the experiments considered here:1. The original texts as documents and the sum-maries as queries. This is a classic SMART run,with the peculiarity that there is only one rele-

    vant document per query.2. Both documents (texts) and queries (sum-maries) are indexed in terms of word-senses.That means that we disambiguate manually allterms. For instance "debate" might be substi-tuted with "debate~l:10:01:?'. The three num-bers denote the part of speech, the WordNetlexicographer's file and the sense number wi thin

    the file. In this case, it is a noun belonging tothe noun.communication file.With this collection we can see if plain disam-biguation is helpful for retrieval, because wordsenses are distinguished but synonymous wordsenses are not identified.

    3. In the previous collection, we substitute eachword sense for a unique identifier of its associ-ated synset. For instance, "debate~l:lO:01:."is substituted with "n04616654", which is anidentifier for"{argument, debate1}" (a discussion in whichreasons are advanced for and against someproposition or proposal; "the argument overforeign aid goes on and on')

    This collection represents conceptual indexing,as equivalent word senses are represented witha unique identifier.

    4. We produced different versions of the synsetindexed collection, introducing fixed percent-ages of erroneous synsets. Thus we simulateda word-sense disambiguation process with 5%,10%, 20%, 30% and 60% error rates. The er-rors were introduced randomly in the ambigu-ous words of each document. With this set ofexperiments we can measure the sensitivity ofthe retrieval process to disambiguation errors.5. To complement the previous experiment, wealso prepared collections indexed with all pos-

    sible meanings (in their word sense and synsetversions) for each term. This represent s a lowerbound for automat ic disambiguation: we shouldnot disambiguate if performance is worse thanconsidering all possible senses for every wordform.6. We produced also a non-di sambiguated versionof the queries (again, bo th in its word sense and

    40

  • 7/27/2019 Indexing With WordNet Synsets Can Improve Text Retrieval

    4/7

    Figure 1 : D i f fe r en t i nd ex in g ap p r o ach e s

    c0u.=o.

    0 .8

    0 .6

    0 .4

    0 .2

    0 10 . 3 0 . 4

    1. Index ing by synsets o2. Index ing by word senses -+ -- -3 . Index ing by words (SMART ) -o- -

    12 ~

    ~ - . . . .

    I I I I !0 . 5 0 . 6 0 . 7 0 . 8 0 . 9R e c a l l

    synset variants). This set of queries was runagainst the manually disambiguated collection.In all cases, we compared arc and ann standardweighting schemes, and they produced very similarresults. Thus we only report here on the results fornnn weighting scheme.

    4 D i s c u s s i o n o f r e s u l t s4.1 Indexing appr oachIn Figure 1 we compare different indexing ap-proaches: indexing by synsets, indexing by words(basic SMART) and indexing by word senses (ex-periments 1, 2 and 3). The leftmost point in eachcurve represents the percentage of documents thatwere successfully ranked as the most relevant for itssummary/query. The next point represents the doc-uments retrieved as the first or the second most rel-evant to its summary/query, and so on. Note tha t,as there is only one relevant document per query,the leftmost point is the most representative of eachcurve. Therefore, we have included this results sep-arately in Table 1.The results are encouraging:

    Indexing by WordNet synsets produces aremarkable improvement on our test collection.A 62% of the documents are retrieved in firstplace by its summary, against 48% of the ba-sic SMART run. This represents 14% more

    documents, a 29% improvement with respectto SMART. This is an excellent result, al-though we should keep in mind that is obtainedwith manually disambiguated queries and doc-uments. Nevertheless, it shows th at WordNetcan greatly enhance text retrieval: the problemresides in achieving accurate automatic WordSense Disambiguation. Indexing by word senses improves perfor-mance when considering up to four documentsretrieved for each query/summary, although itis worse than indexing by synsets. This con-firms our intuition tha t synset indexing has ad-vantages over plain word sense disambiguation,because it permits ma tching semantic ally simi-lar terms.Taking only the first document retrieved foreach summary, the disambiguated collectiongives a 53.2% success against a 48% of theplain SMART query, which represents a 11% im-provement. For recall levels higher tha n 0.85,however, the disambiguated collection performsslightly worse. This may seem surpris ing, asword sense disambiguation should only increaseour knowledge about queries and documents.But we should bear in mind that WordNet 1.5 isnot the perfect database for text retrieval, andindexing by word senses prevents some matchings that can be useful for retrieval. For in-

    4 1

  • 7/27/2019 Indexing With WordNet Synsets Can Improve Text Retrieval

    5/7

    t -Oo

    tl .

    0 .8

    0 .6

    0 .4

    0 .2

    0 I0 . 3 0 . 4

    F i g u r e 2 : s e n s i ti v i ty t o d i s a m b i g u a t i o n e r r o r s! ! I !

    1. Ma nua l disambiguat ion x2 . 5 % e r r o r - ~ - - -3 . 1 0 % e r r o r - E3 --4 . 20 % er ro r . -~ . .. .5 . 3 0 % e r r o r - -~ - --6 . A l l possible syn sets per word (withoutdisambigua~on) -~. -7 . 6 0 % e r r o r -

  • 7/27/2019 Indexing With WordNet Synsets Can Improve Text Retrieval

    6/7

    c -ogJa.

    0 .8

    0 .6

    0 .4

    0. 2

    00. 3

    Figure 3: Per formance w i th non-d isambiguated queriesi ! ! !

    I ndexi ng by w ords (S M A R T) oSynset index ing w i th non-disambiguated queries -+---Wo rd-sense index ing w i th non-disambiguated queries - D - -

    1 2

    - o . . . o ~ " ' l b .

    " " -g l . .

    I I I l I I0.4 0.5 0.6 0.7 0.8 0.9Recal l

    ambiguous words of English) are reported repor tedin (Ng, 1997). They reach a 58.7% accuracy on aBrown Corpus subset and a 75.2% on a subset of theWall Street Journal Corpus. A more careful evalua-tion of the role of WSD is needed to know if this isgood enough for our purposes.Anyway, we have only emula ted a WSD a lgori thm

    that just picks up one sense and discards the rest. Amore reasonable approach here could be giving dif-ferent probabilities for each sense of a word, and usethem to weight synsets in the vectorial representa-tion of documents and queries.

    4.3 Perf orma nce for non-d i sam bi guate dqueriesIn Figure 3 we have plot the results of runs witha non-disambiguated version of the queries, both forword sense indexing and synset indexing, against themanually disambiguated collection (experiment 6).The synset run performs approximate ly as the basicSMART run. It seems therefore useless to apply con-ceptual inde.,dng if no disambiguation of the query isfeasible. This is not a major problem in an interac-tive system that may help the user to disambiguatehis query, but it must be taken into account if theprocess is not interactive and the query is too shortto do reliable disambiguation.

    5 Co n c lu s io n sW e h a v e e x p e r i m e n t e d w i t h a r e tr i ev a l a p p r o a c hb a s e d o n i n d e x i n g i n t e r m s o f W o r d N e t s y n s e t s i n-s t e a d o f w o r d f o r m s , t r y i n g t o a d d r e s s t w o q u e s ti o n s :1 ) w h a t p o t e n t i a l d o e s W o r d N e t o f f e r f o r t e x t r e-t ri ev al , a b s t r a c t i n g f r o m t h e p r o b l e m o f s e n s e d i s a m -biguation, and 2) wha t is the sensit ivity of retrievalperformance to disambiguation errors. The answerto the first question is that indexing by synsetscan be very helpful for text retrieval, our experi-ments give up to a 29% improvement over a sta ndardSMART run indexing with words. We believe th atthese results have to be further contrasted, but theystrongly suggest that WordNet can be more usefulto Text Retrieval than it was previously thought.

    The second question needs further, more fine-grained, experiences to be clear ly answered. How-ever, for our test collection, we find that error ratesbelow 30% still produce better results than stan-dard word indexing, and that from 30% to 60% er-ror rates, it does not behave worse than the standa rdSMART run. We also find that the queries have tobe disambiguated to take advantag e of the approach;otherwise, the best possible results with synset in-dexing does not improve the performance of stan-dard word indexing.

    Our first goal now is to improve our retrievalsystem in many ways, studying how to enrich thequery with semantically related synsets, how to corn-

    43

  • 7/27/2019 Indexing With WordNet Synsets Can Improve Text Retrieval

    7/7

    p a r e d o c u m e n t s a n d q u e r ie s u s i n g s e m a n t i c i n f o r m a -t i o n b e y o n d t h e c o s i n e m e a s u r e , a n d h o w t o o b t a i nwe i gh t s for synse t s acco rd i ng to t h e i r po s i t i on i n t heWordNe t h i e rarchy , among o the r i s sues .A second goal i s t o app ly synse t i ndex i ng i n aCross-Lang uage env i ronmen t , us i ng t he Euro Word-Net mul t i l i ngual da t abase (Gonzalo e t a l . , I n p re ss ) .Index i ng by synse t s o f fe r s a n ea t wa y o f pe r fo rmi n gl a n g u a g e- i n d ep e n d e n t r e t r ie v a l , b y m a p p i n g s y n s e t si n t o t h e E u r o W o r d N e t InterLingual Index t h a t f i n k smono l i ngual wordne t s fo r a l l t he l anguages cove redb y E u r o W o r d N e t .A c k n o w l e d g m e n t sThis research i s being suppo rted by the EuropeanCommuni ty , project LE #4003 and also par t ial ly bythe Spanish government, project TIC-96-1243-CO3-O1.We are indebted to Ren~e Pohlm ann for g iving u s goodpointers at an early stage of this w ork, and to A nselmoPefias and David FernAndez for their help finishing u pthe test collection.R e f e r e n c e sJ . Gonzalo , M. F . Ve rde j o , C . Pe t e r s , and N. Cal -zo lari . I n p re ss . App ly i ng Eur oW ord ne t t o m ul t i -l ingual text retrieval. Journal of Computers andthe Humanities, Special Issue on Euro WordNet.D. K. Harman . 1993 . T he f i r s t t ex t r e t r i eva l con -fe rence (TREC-1) . Information Processing andManagement, 29(4):411--414.S. Landes, C. Leacoc k, and R . Tengi . 1998. Bui ld-i ng seman t ic concordances . In WordNet: An Elec-tronic Lexical Database. M I T P r e s s .G. A. Mi l ler , C. Leacock, R. Tengi , and R. T.Bunke r . 199 3 . A seman t i c concordan ce . In Pro-ceedings of the ARPA Workshop on Human Lan-guage Technology. M o r g a n K a n f f m a n .G. Mi l ler . 1990. Special issue , Wor dne t : An on- l inelexical database . International Journal of Lexi-cography, 3(4).H . T . Ng. 1997 . Exe mp lar -bas ed wo rd sense d i s -ambi guat ion : Som e recen t i mp rovem en t s . In Pro-ceedings of the Second Conference on EmpiricalMethods in NLP.R. R i chardson and A.F . Sm eaton . 1995 . Us i ngWordne t i n a knowledge -based approach to i n fo r -m ation retrieval. In Proceedings of the BCS-IRSGColloquium, Crewe.G. Sal ton, edi tor . 1 971. The S M A R T Retrieval Sys-

    tem: Experiments in Automatic Document Pro-cessing. Prent ice-Hal l .M. Sande rson. 199 4 . W ord sense d i sam bi guat i onan d information retrieval. In Proceedings of 17thInternational Con[erence on Research and Devel-opment in Information Retrieval.A.F. Smeaton and A. Qui g l ey . 1996. E x p e r i m e n t son us ing seman t i c d i s tances be tw een w ords i n i m-age capt ion re t rieval . In Proceedings of the 19 a

    44

    International Conference on Research and Devel-opment in IR.A . S m e a t o n , F . Kel l edy , and R . O 'D onne l l . 1995 .T R E C - 4 e x p e r i m e n t s a t d u b l in c i ty u n i v e rs i ty :T h r e s o l d i n g p o s t i n g l i s t s , q u e r y e x p a n s i o n w i t hW o r d n e t a n d P O S t a g g i n g o f s p a n i s h . I n Proceed-ings of TREC-4.El l en M. Voorhees . 1994 . Q ue r y expa ns i on -us i nglex i ea l- seman t i e re l a t i ons . In Proceedings of the17th Annual International ACM-SIGIR Confer-ence on Research and Development in InformationRetrieval.