Transliteration Transliteration CS 626 course seminar by Purva Joshi 08305907 Mugdha Bapat 07305916...

Preview:

Citation preview

TransliterationTransliteration

CS 626 CS 626 course seminar bycourse seminar by

Purva Joshi 08305907 Mugdha Bapat 07305916

Aditya Joshi 08305908 Manasi Bapat 08305906

Humans transliterate frequently for different reasons

Can a machine do this?

(Why would a machine have to do this?)

If yes, how?Picture courtesy: Snapshot of Yahoo! Messenger

Motivation

• An important component of machine translation

• When you cannot translate, transliterate• Generally used for named entities, technical

terms and out of vocabulary words (OOV)• Issues specific to sounds, scripts and

accents• Can a machine do this? If yes, how?

• Task of converting a word from one alphabetic script to another

Used for:

• Named entities

• : Gandhiji

• Out of vocabulary words

• : Bank

What is transliteration?

• Accents

: Thoda or thora?

• Mapping of sounds

• Mahaan: Kahaan:

• Back-transliteration

Linguistic issues

Arabic Chinese Hindi/ Japanese

• Arabic b -> English p or bEnglish word: Paul transliterates toArabic word: Baul (issue in Back-transliteration)

• Origin of the proper noun determines the symbol in Chinese language

• Ideographic symbols in Chinese

• Several English symbols do not mapto any Japanese symbols. So, oftenmapped to closest sounding symbolice cream aisukuriimu

Linguistic Issues : Mapping of sounds

• Symbols map to different symbolsbased on their position

America

• Difference in originRestaurantconstant

xOverview

Source String

TransliterationUnits

Target String

TransliterationUnits

Contents

Source String

TransliterationUnits

Target String

TransliterationUnits

Phoneme- based

Phoneme-based approach

Word inSource language

Pronunciationin

Source language

Word inTarget language

PronunciationIn

target language

P( ps | ws)

P ( pt | ps )

P ( wt | pt )

Note: Phoneme is the smallest linguistically distinctive unit of sound.

P(wt)

Wt* = argmax (P (wt). P (wt | pt) . P (pt | ps) . P (ps | ws) )

Phoneme-based approach

Step I :

Consider each character of the word

Transliterating ‘BAPAT’B A P A T

P /ə//ə/ /a://a://ə//ə/ /a://a:/B T

Source word

to phonemes

P /ə//ə/ /a://a://ə//ə/ /a://a:/BT

Source phonemes

to target phonemes

t

t

Step II : Converting to phoneme seq.Step III : Converting to target phoneme seq.

Phoneme-based approach

Step IV : Phoneme sequence to target stringB : /ə/ :/ə/ :

/a:/ :/a:/ :

P:/ə/ :/ə/ :

/a:/ :/a:/ :

T:

t:

Output :

Concerns

Word inSource language

Pronunciationin

Source language

Word inTarget language

PronunciationIn

target language

Check if the world is validIn target language

Check if environment Is noise-free

• Unknown pronunciations

• Back-transliteration can be a problem Johnson Jonson

Issues in phonetic model

sanhita

samhita

Contents

Source String

TransliterationUnits

Target String

TransliterationUnits

Phoneme- based

Spelling-based

• Maps source word sequences to target word sequences (i.e. direct word to word)

• The transliteration score:

• P(w)

Spelling-based model

Letter trigram model included

Thus, we can accommodate the words not included in the dictionary

Pronunciationin

Source language

PronunciationIn

target language

Word inSource language

Word inTarget language

Comparison of the two methods

Contents

Source String

TransliterationUnits

Target String

TransliterationUnits

Phoneme- based

Spelling-based

Joint Source

Channel

• Particularly developed for Chinese

• Chinese : Highly ideographic

• Example :

• Two main steps:

The Third Method - Why?

Image courtesy: wikimedia-commons

Modeling Decoding

Modeling Step

• A bilingual dictionary in the source and target language

• From this dictionary, the character mapping between the source and target language is learnt

The word “Geo” has two possible mappings, the “context” in which it occurs is important

John

Georgia

Geology

Geo

Geo

Modeling step

Modeling step …

• N-gram Mapping :

• < Geo, > < rge, >

• < Geo, > < lo, >

• This concludes the modeling step

Modeling step …

Decoding Step

• Consider the transliteration of the word “George”.

• Alignments of George:

• Geo rge G eo rge

• Geo rge G eo rge

Decoding step

Decision to be made between….

• The context mapping is present in the map-dictionary

• Using ……

Decoding step …

• Where do the n-gram statistics come from?

Ans.: Automatic analysis of the bilingual dictionary

• How to align this dictionary?

Ans. : Using EM-algorithm

Transliteration Alignment

EM Algorithm

Bootstrap

Expectation

Maximization

TransliterationUnits

Bootstrap initial random alignment

Update n-gram statistics to estimate probability distribution

Apply the n-gram TM to obtain new alignment

Derive a list of transliteration units from final alignment

Evaluation

E2C Error rates for n-gram tests E2C v/s C2E for TM Tests

Conclusion

• Transliteration can make use of phonemes as an intermediate layer to move from a script to another

• Spelling-based approach connects the word sequences of the two languages

• The joint source channel method integrates optimization of alignment and transliteration

• no pre-alignment needed• reduction in development efforts

( the end )

References

• For all Devnagari transliterations, www.quillpad.in/hindi/

H. Li,M. Zhang, and J. Su. 2004. A joint source-channel model for machine transliteration. In ACL, pages 159–166.

www.wikipedia.org

Y. Al-Onaizan and K. Knight. 2002. Machine transliteration of names in Arabic text. In ACL Workshop on Comp. Approaches to Semitic Languages.

K. Knight and J. Graehl. 1998. Machine transliteration. Computational Linguistics, 24(4):599–612.

N. AbdulJaleel and L. S. Larkey. 2003. Statistical transliteration for English-Arabic cross language information retrieval. In CIKM, pages 139–146.

• Joint source-channel model

• Phoneme and spelling-based models