15
FIRE 2013 By:- Hardik Joshi 1 , Apurva Bhatt 1 , Honey Patel 2 {hardikjjoshi,apurva.bhatt7,Honeypatel.39}@gmail.com 1 Department of Computer Science, Gujarat University, Ahmedabad, India. 2 L.J. College of Engineering, Ahmedabad, India Dec Presentation on : Transliterated Search using Syllabification Approach @FIRE 4rth Dec 2013

FIRE 2013 By:- Hardik Joshi 1, Apurva Bhatt 1, Honey Patel 2 {hardikjjoshi,apurva.bhatt7,Honeypatel.39}@gmail.com 1 Department of Computer Science, Gujarat

Embed Size (px)

Citation preview

Page 1: FIRE 2013 By:- Hardik Joshi 1, Apurva Bhatt 1, Honey Patel 2 {hardikjjoshi,apurva.bhatt7,Honeypatel.39}@gmail.com 1 Department of Computer Science, Gujarat

FIRE 2013

By:-Hardik Joshi1, Apurva Bhatt1,Honey Patel2

{hardikjjoshi,apurva.bhatt7,Honeypatel.39}@gmail.com1Department of Computer Science, Gujarat University, Ahmedabad,

India.2L.J. College of Engineering, Ahmedabad, India

 Dec

 

Presentation on  :

Transliterated Search using Syllabification Approach

@FIRE 4rth Dec 2013

Page 2: FIRE 2013 By:- Hardik Joshi 1, Apurva Bhatt 1, Honey Patel 2 {hardikjjoshi,apurva.bhatt7,Honeypatel.39}@gmail.com 1 Department of Computer Science, Gujarat

Introduction Our Approach Syllabification Our Results Error And Analysis Conclusion

Content

Page 3: FIRE 2013 By:- Hardik Joshi 1, Apurva Bhatt 1, Honey Patel 2 {hardikjjoshi,apurva.bhatt7,Honeypatel.39}@gmail.com 1 Department of Computer Science, Gujarat

There is need to provide local language support in web based applications because various domains such as ecommerce sites require English knowledge.

The challenge in transliteration is take the word “रा�ष्ट्रपति�” for this word “rashtrapati”, “rashtrapathi”, “raashtrapathy”, “raashtrpati” are various possible combinations may possible which one should be correct is again an issue.

Transliteration tasks become difficult in presence of out of vocabulary words (OOV) and noisy words.

Introduction

Page 4: FIRE 2013 By:- Hardik Joshi 1, Apurva Bhatt 1, Honey Patel 2 {hardikjjoshi,apurva.bhatt7,Honeypatel.39}@gmail.com 1 Department of Computer Science, Gujarat

In both the subtasks, the transliteration was performed using syllabification approach.

In the subtask-1, we had done the morphological analysis of English words , then a corpus based approach used to identify frequently occurring Hindi words.

In the subtask-2, the queries were formulated that contained both Roman and Devanagari script and Roman script for separate run submissions.

Page 5: FIRE 2013 By:- Hardik Joshi 1, Apurva Bhatt 1, Honey Patel 2 {hardikjjoshi,apurva.bhatt7,Honeypatel.39}@gmail.com 1 Department of Computer Science, Gujarat

Linguists have different languages have constraints on possible consonant and vowel sequences that characterize not only the word structure for the language but also the syllable structure.

Vowels @ center (nucleus) consonant @ beginning (onset) End is coda

Syllabification Approach

sy

llabl

e

Onset

nucleus

coda

Rhyme

Page 6: FIRE 2013 By:- Hardik Joshi 1, Apurva Bhatt 1, Honey Patel 2 {hardikjjoshi,apurva.bhatt7,Honeypatel.39}@gmail.com 1 Department of Computer Science, Gujarat

Word

Sprint

Syllable Structure Example

Page 7: FIRE 2013 By:- Hardik Joshi 1, Apurva Bhatt 1, Honey Patel 2 {hardikjjoshi,apurva.bhatt7,Honeypatel.39}@gmail.com 1 Department of Computer Science, Gujarat
Page 8: FIRE 2013 By:- Hardik Joshi 1, Apurva Bhatt 1, Honey Patel 2 {hardikjjoshi,apurva.bhatt7,Honeypatel.39}@gmail.com 1 Department of Computer Science, Gujarat

Source Target s u d a k a r स ◌ु� द ◌ु� क रा c h h a g a n छ ग ण j i t e s h ज �◌ु◌ु � ◌ु� श n a r a y a n न ◌ु� रा ◌ु� य ण s h i v श �◌ु◌ु व m a d h a v म ◌ु� ध व m o h a m m a d म ◌ु� ह म ◌ु� म द

Training Format

Page 9: FIRE 2013 By:- Hardik Joshi 1, Apurva Bhatt 1, Honey Patel 2 {hardikjjoshi,apurva.bhatt7,Honeypatel.39}@gmail.com 1 Department of Computer Science, Gujarat

Algorithm for subtask-I

Step 1: First of all words are fetching in English dictionary. Step 2: perform spell-check ,stemming and also morphological

analysis for English language, if no spell error and match found then label the word as English =E.

Step 3: If English word are not found then check with English corpus of US News paper.

Step 4: If English word found then check with English corpus of Indian news paper.

Step 5: If English word found in US News paper and not found in Indian news paper then word=E.

Page 10: FIRE 2013 By:- Hardik Joshi 1, Apurva Bhatt 1, Honey Patel 2 {hardikjjoshi,apurva.bhatt7,Honeypatel.39}@gmail.com 1 Department of Computer Science, Gujarat

Step 6: Step 2 and step 5 are parallel apply for English words and label as =\E.

Step 7: Remaining words would be transliterate into Hindi words and Label the word as = \H.

Step 8: Apply to Moses tool ,which one is help English words transliterate into Hindi words.

Page 11: FIRE 2013 By:- Hardik Joshi 1, Apurva Bhatt 1, Honey Patel 2 {hardikjjoshi,apurva.bhatt7,Honeypatel.39}@gmail.com 1 Department of Computer Science, Gujarat

RESULT OF SUBTASK-1

Page 12: FIRE 2013 By:- Hardik Joshi 1, Apurva Bhatt 1, Honey Patel 2 {hardikjjoshi,apurva.bhatt7,Honeypatel.39}@gmail.com 1 Department of Computer Science, Gujarat

Results For Subtask 2

Run 1 “म�रा� स�पन�न तिक रा�न� क�ब्� आय�ग� �� mere sapnon ki rani kab aayegi tu”.

Run 2 “mere sapnon ki rani kab aayegi tu”.

Metrics Run-1 Run-2 Maximum Score

Median Score

nDCG@5 0.5627 0.5262 0.8052 0.5620

nDCG@10 0.5619 0.5232 0.8002 0.5608

MAP 0.2546 0.2163 0.4236 0.2355

MRR 0.5835 0.5730 0.8440 0.5884

Page 13: FIRE 2013 By:- Hardik Joshi 1, Apurva Bhatt 1, Honey Patel 2 {hardikjjoshi,apurva.bhatt7,Honeypatel.39}@gmail.com 1 Department of Computer Science, Gujarat

There are some problems in the transliteration which decreased the precision.

Error in the maatra : “sapnon” => “स�पन�न”, “ki” => “क ”, “kab” => “क�ब्”, “main” => “मिमन” & “mein” => “म�न” , na => न & ka => क

Multiple Mapping of the words e.g. T = �, ट, i.e. tera=>ट�रा�, tum => �#म, to => ट�, teri =>ट�रिरा .

Missing sounds (फ, ख, छ ‘chh’, ksh) i. e. for word “accha” we got “आक्क�”, for , “poochho” we got “प#छ�ट”.

Error And Analysis

Page 14: FIRE 2013 By:- Hardik Joshi 1, Apurva Bhatt 1, Honey Patel 2 {hardikjjoshi,apurva.bhatt7,Honeypatel.39}@gmail.com 1 Department of Computer Science, Gujarat

Multiple Transliterations- c,k The vowel are not giving perfect answers i.e. “lo” => “लॉ*” , “ho”=> “ह�रा”, “ko” => “क*” Spelling Variations(shree,shri) Conjuncts formation(“kya” => “क� य�”) Missing of vowels ‘ak tr khan’ (अक ◌ु� �रा ख�न) ‘y’ As Vowel: ‘anthony’ & ‘Shyam’

Page 15: FIRE 2013 By:- Hardik Joshi 1, Apurva Bhatt 1, Honey Patel 2 {hardikjjoshi,apurva.bhatt7,Honeypatel.39}@gmail.com 1 Department of Computer Science, Gujarat

We used the syllabification approach and considered the most probable term in the transliteration process. The word labeling task was performed assuming that a term either belongs to English language or Hindi language. We were able to get high accuracy in English recall as the labeling approach used morphological analysis and dictionary approach. However due to syllabification model, the transliteration did not give high precision resulting in lower precision of transliteration tasks and subsequently lower precision metrics in the song lyrics retrieval tasks.

Conclusion