Let Sense Bags Do Talking: Cross Lingual Word pb/papers/icon15-sense-bag.pdfLet Sense Bags Do Talking: page 1
Let Sense Bags Do Talking: Cross Lingual Word pb/papers/icon15-sense-bag.pdfLet Sense Bags Do Talking: page 2
Let Sense Bags Do Talking: Cross Lingual Word pb/papers/icon15-sense-bag.pdfLet Sense Bags Do Talking: page 3
Let Sense Bags Do Talking: Cross Lingual Word pb/papers/icon15-sense-bag.pdfLet Sense Bags Do Talking: page 4
Let Sense Bags Do Talking: Cross Lingual Word pb/papers/icon15-sense-bag.pdfLet Sense Bags Do Talking: page 5

Let Sense Bags Do Talking: Cross Lingual Word pb/papers/icon15-sense-bag.pdfLet Sense Bags Do Talking:

Embed Size (px)

Text of Let Sense Bags Do Talking: Cross Lingual Word pb/papers/icon15-sense-bag.pdfLet Sense Bags Do...

  • Let Sense Bags Do Talking: Cross Lingual Word Semantic Similarity forEnglish and Hindi

    Apurva NagvenkarDCST, Goa University

    apurv.nagvenkar@gmail.com

    Jyoti PawarDCST, Goa Universityjyotidpawar@gmail.com

    Pushpak BhattacharyyaCSE, IIT Bombaypb@cse.iitb.ac.in

    Abstract

    Cross Lingual Word Semantic (CLWS)similarity is defined as a task to find thesemantic similarity between two wordsacross languages. Semantic similarity hasbeen very popular in computing the sim-ilarity between two words in same lan-guage. CLWS similarity will prove to bevery effective in the area of Cross LingualInformation Retrieval, Machine Transla-tion, Cross Lingual Word Sense Disam-biguation, etc.

    In this paper, we discuss a system thatis developed to compute CLWS similarityof words between two languages, whereone language is treated as resourceful andother is resource scarce. The system is de-veloped using WordNet. The intuition be-hind this system is that, two words are se-mantically similar if their senses are sim-ilar to each other. The system is testedfor English and Hindi with the accuracy60.5% precision@1 and 72.91% preci-sion@3.

    1 Introduction

    Word Semantic Similarity between two words isrepresented by the similarity between concepts as-sociated with it. It plays a vital role in NaturalLanguage Processing (NLP) and Information Re-trieval (IR). In NLP (Sinha and Mihalcea, 2007),it is widely used in Word Sense Disambiguation,Question Answering system, Machine Translation(MT) etc. In IR (Hliaoutakis et al., 2006) it canbe used in Image Retrieval, Multimodal DocumentRetrieval, Query Expansions etc.

    The goal of CLWS Similarity is to measure thesemantic similarity between the two words acrosslanguages. In this paper, we have proposed a sys-tem that computes CLWS similarity between two

    languages i.e. Language Ls and Lt, where Ls istreated as resourceful language and Lt is treatedas resource scarce language. Given two words inlanguage Ls and Lt the CLWS Similarity enginewill analyze which two concepts of the word fromLs and Lt are similar. Let m & n be the numberof concepts for the word in Ls & Lt respectivelythen, the output will generate a sorted list of allthe possible combination of concepts (i.e. m nsorted list) ordered by their similarity score andthe topmost combination of the concepts from Lsand Lt are similar to each other. The system is de-veloped and tested for the English and Hindi lan-guage where English is Ls and Hindi is Lt.

    1.1 Semantic Similarity

    Lot of research effort has been devoted to de-sign semantic similarity measures having mono-lingual as parameter. WordNet has been widelyadopted in semantic similarity measures for En-glish due its large hierarchical organization ofsynsets. Monolingual semantic similarity can becomputed using Edge Counting and InformationContent (IC). Edge counting is path based ap-proach which makes use of knowledge bases. Itmeasures the similarity by computing shortest dis-tance from two concepts and the distance is noth-ing but a IS-A hierarchy. There are differentpath based measures such as Path Length Mea-sure, Leacock and Chodorow Measure (Leacockand Chodorow, 1998), Wu & Palmer Measure (Wuand Palmer, 1994). IC is probabilistic based ap-proach which makes use of corpus. It computesthe negative logarithm of the probability of the oc-currence of the concept in a large corpus. The dif-ferent IC approaches are Resnik Measure (Resnik,1995), Jiang Conrath Measure (Jiang and Conrath,1997), Lin Measure (Lin, 1998), etc.

  • 2 The Proposed Idea

    The aim of our work is to design a CLWS similar-ity engine that computes similarity between twowords across languages. So, given a word in En-glish and Hindi the system will analyze which twosynsets of the words are similar by making use ofWordNet.

    To obtain CLWS similarity we follow the fol-lowing steps.

    1. Given a word WEN and WHN and its sensesmust be present in their respective WordNetsi.e. English and Hindi WordNet.

    2. Let SEN = {sEN 1, sEN 2...sENm} &SHN = {sHN 1, sHN 2...sHNn} be the setsense bags. The sense bags are obtained fromits synset constituents i.e. content words fromconcept, examples, synonyms and hypernym(depth=1). We make sure that the words insense bag must be present in the WordNet.

    3. SHN Hindi sense bags must be translated toresourceful language i.e. English. The trans-lations are obtained by making use of bilin-gual dictionary or GIZA++ (Och and Ney,2003) aligner that identifies the word align-ment considering a parallel corpus.

    We say that two words are semantically similar if

    1. WEN is compared with sENi and sHNj

    then their score must be similar. i.e.CLWSw(WEN , sENi) CLWSwtr(WEN ,sHN

    j) this is further explained in section 3.1.

    2. If they have similar sense bags. i.e.CLWSs(sENi, sHNj) 1& CLWSs

    weight(wEN , sENi, sHN

    j) 1this is further explained in section 3.2.1 andsection 3.2.2.

    3 CLWS Similarity Measures

    Following are the different measures that are usedto compute CLWS similarity score.

    3.1 Measure 1: Word to Sense Bag SimilarityMeasure

    In this measure, the source word WEN is com-pared with the words present in the sense bag us-ing monolingual similarity function i.e. equation(1). Where the parameters w1 & w2 are the words

    from the source (English) language and the param-eter is the approach viz. obtained from eitheredge counting or information content.

    sim(w1, w2, ) (1)

    3.1.1 Source Word to Source Sense BagSimilarity Measure

    The source word i.e. WEN is compared with thewords present in the source sense bag sENi i.e.sEN

    i = (ew1,ew2,..,ewp). So, WEN is comparedwith the first word ew1 from sENi using mono-lingual similarity function i.e. sim(WEN , ew1, )the same procedure is applied for the other wordsfrom the source sense bag and we obtain the fol-lowing measure.

    CLWSw(WEN , sENi) =

    1

    p

    pk

    sim(WEN , ewk, )

    (2)

    3.1.2 Source Word to Target Sense BagSimilarity Measure

    Here, the source word i.e. WEN is compared withthe words present in the target (Hindi) sense bagsHN

    j i.e. sHNj = (hw1,hw2,..,hwq). So, WEN iscompared with the first word hw1 from sHNj . Inthis case, hw1 is replaced with its translation. If aword has more than one translation then maximumscore between the translations is considered to bethe winner candidate (equation (3)). The sameprocedure is applied for the other words from thetarget sense bag and we obtain the equation (4) forthis measure.

    simtr(WEN , hwj , ) =l

    maxk=1

    sim(WEN , ewk

    tr, )

    where hwj = (ew1tr, ew2

    tr, .., ewltr) (3)

    CLWSwtr(WEN , sHN

    j) =1

    q

    qk=1

    simtr(WEN , hwk, )

    (4)We say that, two (source and target) sense bagsare similar to each other if they are related to thesource word.

    score1 = 1 | CLWSw(WEN , sENi) CLWSwtr(WEN , sHNj) | (5)

    3.2 Sense Bag to Sense Bag SimilarityMeasure

    In this measure, the source sense bag sENi is com-pared with target sense bag sHNj .

  • 3.2.1 Measure 2: Sense Bag to Sense BagSimilarity without weight

    In this measure, every word from sENi is com-pared with all the words from sHNj . For example,the first word ew1 is compared with all the wordsfrom SHNj using equation (3) and the maximumscore is retrieved. The same process is contin-ued for the remaining words present in the sourcesense bag to get equation (6).

    CLWSs(sENi, sHN

    j)

    =1

    p

    pk=1

    (q

    maxl=1

    simtr(ewk, hwl, )

    )(6)

    score2 = CLWSs(sENi, sHN

    j) (7)

    3.2.2 Measure 3: Sense Bag to Sense BagSimilarity with weight

    This is similar to the above measure (Section3.2.1) but at every iteration the weights are as-signed to the words that are compared. Everyword from both the sense bags is assigned witha weight viz. obtained by comparing it with thesource word i.e. WEN , by using equation(1) &(3). This weight depicts how closely the wordsfrom sense bags are related to WEN .

    CLWSsweight(wEN , sEN

    i, sHNj) =

    1

    p

    pk=1

    sim(wEN , ewk, )

    q

    maxl=1

    (simtr (wEN , hwl, ) simtr (ewk, hwl, ))

    (8)

    score3 = CLWSsweight(wEN , sEN

    i, sHNj)

    (9)

    3.3 Measure 4: Incorporating MonolingualCorpus

    To check the frequency of a sense we compare thesense bag to a large monolingual corpora. A con-text bag (CBEN ), is obtained for a word WENby using word2vec1 toolkit (Mikolov et al., 2013).CBEN is compared with all the sense bags ofSEN (using Sense Bag to Sense Bag Similaritywith weight since, it gives higher accuracy thanothers, refer section 5.3). This method will as-sign a similarity score to all the sense bags. Thesense bag with most frequent usage will be as-signed with high value than the sense bag with

    1https://code.google.com/p/word2vec/

    low frequent usage. This score is further multi-plied with the similarity score obtained from sENi

    and sHNj .

    score4 = CLWSsweight(wEN , sEN

    i, CBEN )

    CLWSsweight(wEN , sENi, sHNj) (10)

    4 Resources and Dataset Used

    Princeton WordNet (Miller et al., 1990) has117791 synsets and 147478 words whereas,Hindi WordNet (Bhattacharyya, 2010) has 38782synsets and 99435 words in its inventory. ThesewordNets are used to derive sense bags for eachlanguage. We have used publicly available pre-trained word embeddings for English which aretrained on Google News dataset2 (about 100 bil-lion words). These word embeddings are availablefor around 3 million words and phrases. We alsomake use of ILCI parallel corpus3 to obtain theword alignment of Hindi with respect to Englishfor translating Hindi word to English. The size ofILCI corpus contains 50,000 parallel sentences.

    To check the performance of our system weneed t