Measuring Semantic Similarity Between Words Using Web

Embed Size (px)

Citation preview

  • 8/3/2019 Measuring Semantic Similarity Between Words Using Web

    1/5

    WWW 2007 / Track: Semantic Web Session: Similarity and Extraction

    easuring Semantic Similarity between Words Using Web

    Search Engines

    Danushka Bollegala Yutaka Matsuo Mitsuru IshizukaThe University of Tokyo National Institute of Advanced The University of TokyoHongo 7-3-1, Tokyo Industrial Science and Hongo 7-3-1, Tokyo

    113-8656, Japan Technology 113-8656, Japan

    [email protected] Sotokanda 1-18-13, Tokyo [email protected], Japan

    [email protected]

    ABSTRACT

    Semantic similarity measures play important roles in infor-mation retrieval and NaturalLanguage Processing. Previ-ous work in semantic web-related applications such ascom-munity mining, relation extraction, automatic meta data extraction have usedvarious semantic similarity measures. Despite the usefulness of semantic similaritymeasures in these applications, robustly measuring semantic similarity between twowords (or entities) remains a challenging task. We propose a robust semanticsimilarity measure that uses the information available on the Web to measuresimilarity between words or entities. The proposed method exploits page counts andtext snippets returned by a Web search engine. We dene various similarity scores fortwo given words Pand Q, using the page counts for the queries P, Q and P AND Q.Moreover, we propose a novel approach to compute semantic similarity usingautomatically extracted lexico-syntactic patterns from text snippets. These dierentsimilarity scores are integrated using support vector ma-chines, to leverage a robustsemantic similarity measure. Experimental results on Miller-Charles benchmarkdataset show that the proposed measure outperforms all the existing web-basedsemantic similarity measures by a wide margin, achieving a correlation coecient of

    0:834. Moreover, the proposed semantic similarity measure signicantly improves theaccuracy (F -measure of 0:78) in a community mining task, and in an entitydisambiguation task, thereby verifying the capability of the proposed measure tocapture semantic similarity using web content.

    Categories and Subject Descriptors

    H.3.3 [Information Systems]: Information Search and Re-trieval

    General Terms

    Algorithms

    Keywords

    semantic similarity, Web mining

    Copyright is held by the International World Wide Web Conference Com-mittee (IW3C2). Distribution ofthese papers is limited to classroom use, and personal use by others.

    WWW 2007, May 812, 2007, Banff, Alberta, Canada.ACM 978-1-59593-654-7/07/0005.

    1. INTRODUCTION

    The study of semantic similaritybetween words has long been an

    integral part of information retrieval andnatural language processing. Semanticsimilarity between entities changes overtime and across domains. For example,apple is frequently associated withcomputers on the Web. How-ever, thissense of apple is not listed in mostgeneral-purpose thesauri or dictionaries.A user who searches for apple on theWeb, may be interested in this sense ofapple and not apple as a fruit. Newwords are constantly being created aswell as new senses are assigned toexisting words. Manually maintainingthesauri to capture these new words and

    senses is costly if not impossible.We propose an automatic method to

    measure semantic similarity betweenwords or entities using Web search en-gines. Because of the vastly numerousdocuments and the high growth rate ofthe Web, it is dicult to analyze eachdocument separately and directly. Websearch engines pro-vide an ecientinterface to this vast information. Pagecounts and snippets are two usefulinformation sources pro-vided by mostWeb search engines. Page count of aquery is the number of pages that

    contain the query words 1. Page count

    for the query P AND Q can beconsidered as a global measure of co-occurrence of words P and Q. Forexample, the page count of the query

    \apple" AND \computer" in Google 2 is288; 000; 000, whereas the same for\banana"AND \computer"is only 3; 590;000. The more than 80 times morenumerous page counts for \apple"AND\com-puter" indicate that apple is moresemantically similar to computer than isbanana.

    Despite its simplicity, using page countsalone as a mea-sure of co-occurrence oftwo words presents several draw-backs.

    First, page count analyses ignore the

  • 8/3/2019 Measuring Semantic Similarity Between Words Using Web

    2/5

    position of a word in a page. Therefore, even though two words appear in a page, theymight not be related. Secondly, page count of a polysemous word (a word with multiplesenses) might contain a combination of all its senses. For an example, page counts forapple contains page counts forapple as a fruit and apple as a company. Moreover, giventhe scale and noise inthe Web, some words might occur arbitrarily, i.e. by ran-dom chance,on some pages. For those reasons, page counts alone are unreliable when measuring

    semantic similarity.

    1Page count may not necessarily beequal to the word fre-quency becausethe queried word might appear manytimes on one page2http:://www.google.com

    757

  • 8/3/2019 Measuring Semantic Similarity Between Words Using Web

    3/5

    WWW 2007 / Track: Semantic Web

    Snippets, a brief window of text extracted by a search en-ginearound the query term in a document, provide useful in-formationregarding the local context of the query term. Se-mantic similaritymeasures dened over snippets, have been used in queryexpansion [36], personal name disambigua-tion [4] and communitymining [6]. Processing snippets is also ecient as it obviates thetrouble of downloading web pages, which might be timeconsuming depending on the size of the pages. However, a widelyacknowledged drawback of using snippets is that, because of thehuge scale of the web and the large number of documents in theresult set, only those snippets for the top-ranking results for aquery can be processed eciently. Ranking of search results,hence snippets, is determined by a complex combination of vari-ous factors unique to the underlying search engine. There-fore, noguarantee exists that all the information we need to measuresemantic similarity between a given pair of words is contained inthe top-ranking snippets.

    This paper proposes a method that considers both page countsand lexico-syntactic patterns extracted from snippets, therebyovercoming the problems described above.

    For example, let us consider the following snippet fromGoogle for the query JaguarANDcat.

    \The Jaguar is the largest cat in Western Hemisphere and cansubdue larger prey than can the puma"

    Here, the phrase is the largest indicates a hypernymic re-lationship between the Jaguar and the cat. Phrases such asalso known as, is a, part of, is an example of all in-dicatevarious semantic relations. Such indicative phrases have beenapplied to numerous tasks with good results, such as hyponymextraction [12] and fact extraction [27]. From the previousexample, we form the pattern X is the largest Y, where wereplace the two words Jaguarand catby twowildcards X and

    Y.Our contributions in this paper are two fold:

    0 We propose an automatically extracted lexico-

    syntactic patterns-based approach to computesemantic similar-ity using text snippets obtainedfrom a Web search engine.

    0 We integrate dierent web-based similaritymeasures using WordNet synsets and supportvector machines to create a robust semanticsimilarity measure. The in-tegrated measureoutperforms all existing Web-based semanticsimilarity measures in a benchmark dataset. Tothe best of our knowledge, this is the rst attemptto combine both WordNet synsets and Webcontent to leverage a robust semantic similaritymeasure.

    The remainder of the paper is organized as follows. In section 2we discuss previous works related to semantic sim-ilaritymeasures. We then describe the proposed method in section 3.Section 4 compares the proposed method against previous Web-based semantic similarity measures and sev-eral baselines on abenchmark data set. In order to evaluate the ability of theproposed method in capturing semantic similarity between real-world entities, we apply it in a com-munity mining task. Finally, weshow that the proposed method is useful for disambiguatingsenses in ambiguous named-entities and conclude this paper.

    Session: Similarity and Extraction

    2. RELATED WORK

    Semantic similarity measures are important in many Web-related tasks. In query expansion [5, 25, 40] a user query ismodied using synonymous words to improve the relevancy of thesearch. One method to nd appropriate words to include in aquery is to compare the previous user queries using semanticsimilarity measures. If there exist a previous query that issemantically related to the current query, then it can be suggested

    either to the user or internally used by the search engine to modifythe original query.

    Semantic similarity measures have been used in Semantic Webrelated applications such as automatic annotation of Web pages[7], community mining [23, 19], and keyword extraction for inter-entity relation representation [26].

    Semantic similarity measures are necessary for various ap-plications in natural language processing such as word-sensedisambiguation [32], language modeling [34], synonym ex-traction[16], and automatic thesauri extraction [8]. Manu-ally compiled

    taxonomies such as WordNet3 and large text corpora have beenused in previous works on semantic sim-ilarity [16, 31, 13, 17].Regarding the Web as a live cor-pus has become an activeresearch topic recently. Simple, unsupervised modelsdemonstrably perform better when n-gram counts are obtained

    from the Web rather than from a large corpus [14, 15]. Resnik andSmith [33] extracted bilin-gual sentences from the Web to create aparallel corpora for machine translation. Turney [38] dened apoint-wise mutual information (PMI-IR) measure using the numberof hits returned by a Web search engine to recognize synonyms.Matsuo et. al, [20] used a similar approach to measure thesimilarity between words and apply their method in a graph-basedword clustering algorithm.

    Given a taxonomy of concepts, a straightforward method tocalculate similarity between two words (concepts) is to nd thelength of the shortest path connecting the two words in thetaxonomy [30]. If a word is polysemous then multiple pathsmight exist between the two words. In such cases, only theshortest path between any two senses of the words is

    considered for calculating similarity. A problem that isfrequently acknowledged with this approach is that it relies onthe notion that all links in the taxonomy represent a uniformdistance.

    Resnik [31] proposed a similarity measure using informa-tion content. He dened the similarity between two conceptsC1and C2in the taxonomy as the maximum of the infor-mationcontent of all concepts Cthat subsume both C1 and C2. Thenthe similarity between two words is dened as the maximum ofthe similarity between any concepts that the words belong to.He used WordNet as the taxonomy; information content iscalculated using the Brown corpus.

    Li et al., [41] combined structural semantic information froma lexical taxonomy and information content from a cor-pus in anonlinear model. They proposed a similarity mea-sure thatuses shortest path length, depth and local density in ataxonomy. Their experiments reported a Pearson cor-relationcoecient of 0:8914 on the Miller and Charles [24] benchmarkdataset. They did not evaluate their method in terms ofsimilarities among named entities. Lin [17] dened thesimilarity between two concepts as the information that is incommon to both concepts and the information con-tained ineach individual concept.

    3http://wordnet.princeton.edu/

    758

  • 8/3/2019 Measuring Semantic Similarity Between Words Using Web

    4/5

    WWW 2007 / Track: Semantic Web

    Recently, some work has been carried out on measuringsemantic similarity using Web content. Matsuo et al., [19]proposed the use of Web hits for extracting communities onthe Web. They measured the association between two per-sonal names using the overlap (Simpson) coecient, which iscalculated based on the number of Web hits for each indi-vidual name and their conjunction (i.e., AND query of the twonames).

    Sahami et al., [36] measured semantic similarity betweentwo queries using snippets returned for those queries by asearch engine. For each query, they collect snippets from asearch engine and represent each snippet as a TF-IDF-weighted term vector. Each vector is L2 normalized and thecentroid of the set of vectors is computed. Semantic similaritybetween two queries is then dened as the inner productbetween the corresponding centroid vectors. They did notcompare their similarity measure with taxonomy-basedsimilarity measures.

    Chen et al., [6] proposed a double-checking model usingtext snippets returned by a Web search engine to computesemantic similarity between words. For two words P and Q,they collect snippets for each word from a Web searchengine.Then they count the occurrences of word Pin the snippets forword Q and the occurrences of word Q in the snippets for wordP . These values are combined nonlinearly to compute thesimilarity between Pand Q. This method depends heavily onthe search engine's ranking algorithm. Although two words Pand Q might be very similar, there is no reason to believe thatone can nd Q in the snippets for P , or vice versa. Thisobservation is conrmed by the exper-imental results in theirpaper which reports zero similarity scores for many pairs ofwords in the Miller and Charles [24] dataset.

    3. METHOD

    3.1 Outline

    We propose a method which integrates both page counts

    and snippets to measure semantic similarity between a givenpair of words. In section 3.2, we dene four similarity scoresusing page counts. We then describe an automatic lexico-syntactic pattern extraction algorithm in section 3.3. We rankthe patterns extracted by our algorithm according to theirability to express semantic similarity. We use two-classsupport vector machines (SVMs) to nd the optimalcombination of page counts-based similarity scores and top-ranking patterns. The SVM is trained to classify synony-mousword-pairs and non-synonymous word-pairs. We selectsynonymous word-pairs (positive training examples) from

    WordNet synsets4. Non-synonymous word-pairs (negativetraining examples) are automatically created using a ran-domshuing technique. We convert the output of SVM into aposterior probability. We dene the semantic similar-itybetween two words as the posterior probability that theybelong to the synonymous-words (positive) class.

    3.2 Page-count-based Similarity Scores

    Page counts for the query P AND Q, can be considered asan approximation of co-occurrence of two words (or multi-wordphrases) Pand Q on the Web.

    4Informally, a synset is a set of synonymous wordsSession: Similarity and Extraction

    However, page counts for the query P AND Q alone do notaccurately express semantic similarity. For example, Googlereturns 11; 300; 000 as the page count for \car" AND \au-tomobile", whereas the same is 49; 000; 000 for \car" AND\apple". Although, automobile is more semantically simi-lar tocarthan apple is, page counts for query \car" AND\apple"aremore than four times greater than those for thequery \car" and\automobile". One must consider the page counts not just forthe query P AND Q, but also for the indi-vidual words Pand Qto assess semantic similarity between Pand Q.

    We modify four popular co-occurrence measures; Jaccard,Overlap (Simpson), Dice, and PMI (Point-wise mutual infor-mation), to compute semantic similarity using page counts.

    For the remainder of this paper we use the notation H(P ) todenote the page counts for the query P in a search engine. TheWebJaccardcoecient between words (or multi-word phrases) Pand Q, WebJaccard(P; Q), is dened as,

    =(0 ifH(P\Q) c (1)H(P \Q) otherwise:

    H(P)+H(Q)H(P \Q)

    Therein, P\Q denotes the conjunction query P AND Q. Given thescale and noise in Web data, it is possible that two words may

    appear on some pages purely accidentally. In order to reduce theadverse eects attributable to random co-occurrences, we set theWebJaccard coecient to zero if

    the page count for the query P\Q is less than a threshold c5.Similarly, we dene WebOverlap, WebOverlap(P; Q), as,

    =(0 ifH(P\Q) c (2)H(P \Q) otherwise:

    min(H(P);H(Q))

    WebOverlap is a natural modication to the Overlap (Simp-son) coecient.

    We dene the WebDice coecient as a variant of the Dice

    coecient. WebDice(P; Q) is dened as,

    =(0 ifH(P\Q) c (3)2H(P\Q) otherwise:

    H(P)+H(Q)

    We dene WebPMI as a variant form of PMI using pagecounts as,

    =

    8 < : 0 ifH(P\Q) cH(P \Q) (4)

    log2( H(P)N

    H(Q)) otherwise:

    N N

    Here, N is the number of documents indexed by the searchengine. Probabilities in Eq. 4 are estimated according to themaximum likelihood principle. To calculate PMI accu-ratelyusing Eq. 4, we must know N, the number of docu-mentsindexed by the search engine. Although estimating the numberof documents indexed by a search engine [2] is an interestingtask itself, it is beyond the scope of this work. In the presentwork, we set N = 1010 according to the number of indexedpages reported by Google.

    5we set c= 5 in our experiments

    759

  • 8/3/2019 Measuring Semantic Similarity Between Words Using Web

    5/5

    Activate your software for less than $20http://www.pdfonline.com/easyconverter/

    Thank you for evaluating

    BCL easyConverter Desktop

    This Word document was converted from PDF with an evaluation

    version of BCL easyConverter Desktop software that only

    converts the first 3 pages of your PDF.

    CTRL+ Click on the link below to purchase

    http://www.pdfonline.com/easyconverter/%22%2000d0c9ea79f9bace118c8200aa004ba90b0200000003000000e0c9ea79f9bace118c8200aa004ba90b6800000068007400740070003a002f002f007700770077002e007000640066006f006e006c0069006e0065002e0063006f006d002f00650061007300790063006f006e007600650072007400650072002f000000795881f43b1d7f48af2c825dc485276300000000a5ab00000http://www.pdfonline.com/easyconverter/%22%2000d0c9ea79f9bace118c8200aa004ba90b0200000003000000e0c9ea79f9bace118c8200aa004ba90b6800000068007400740070003a002f002f007700770077002e007000640066006f006e006c0069006e0065002e0063006f006d002f00650061007300790063006f006e007600650072007400650072002f000000795881f43b1d7f48af2c825dc485276300000000a5ab00000