40
Decoding Chinese character sequence: neural network model and beyond Zhao Hai Shanghai Jiao Tong University 上海交通大学 [email protected] 福州, 2015.04.19

Decoding Chinese character sequence: neural network model and beyond

Embed Size (px)

Citation preview

  • Decoding Chinese character sequence: neural network model and beyond

    Zhao Hai Shanghai Jiao Tong University

    [email protected]

    , 2015.04.19

  • Outlines Motivation

    Chinese character plays the role Sino-Tibetan languages

    Sinosphere languages

    Chinese IME

    Loose Machine translation

    Neural Network Language model

    Experimental results

    Conclusion

  • Chinese connection

    Chinese characters have more external connections than Chinese language itself.

    Chinese is only related to Sino-Tibetan languages, but Chinese characters may introduce more relative languages.

  • Sino-Tibetan Language Family Tree

    4

  • Sino-Tibetan Languages in Map

  • Sinosphere, writing connects all

    Chinese characters, kanji, well developed logograms, the oldest continuously used system of writing in the world, are still used in China, Vietnam, Korea peninsular, Japan and Singapore, Malaysia.

    The Sinosphere, is unofficially referred to regions that have been historically or culturally influenced by China.

    However, languages in sinosphere have weak linguistic relations with Chinese

  • Sino-Tibetan languages, writing in the same way Lolo(Yi)/

  • Why character writing system leads to the future

    Accommodate more reading change Vietnamese, reading and writing differences occur only within 50 years

    Burmese, writing is one way, reading is another, 1000 years

    Accommodate more free word order

  • Four Types of Word Order Sino-Tibetan

    Languaages

    Chinese Bai

    Verb Object Modifier Noun

    Karen Shan Thai

    Verb Object Noun Modifier

    Jingpho Object Verb Modifier Noun

    Burmese Tibetan Lolo Qiang

    Object Verb Noun Modifier

  • Free order in Chinese

    Is Chinese in SOV order? NO

    Is Chinese support a modifier-noun order NO

  • Which order in Chinese

    Chinese is a free order language, Only detailed semantics and functional words matter

    Every types of word order is possible

    Thats the future: 4,000 years evolution with the largest population Thus we need a character-based writing system

  • Sino-Tibetan Languages

    From Writing

    Chinese 20 centuray BC hanzi

    Tibetan 7 century abugida

    Tangut 11 century Self-made hanzi

    Burmese 11 century abugida

  • Alphabetization of languages in Sinosphere (red means official position)

    China etc Japan Korea Vietnam

    Romanization alphabets

    Chinese pinyin Romanization scheme for Japanese

    Romanization scheme for Korean

    Chu Quoc Ngu ()

    National alphabets Kana() hangul

    Chinese characters Hanzi() Kanji()

    hanja() Hantu / Chu Nom (/)

  • Application Driven Syllable-to-character conversion tasks Chinese pinyin IME (input method engine)

    From pinyin sequence to Chinese character sentence

    Loose Machine translation From kana, hangul, Vietnamese to Chinese character sentence

    Rewriting those Sino-Tibetan languages

    more To see multilingual pronounciation difference on the basis of semantic

    equivalence.

  • Pinyin based Chinese IME

    Most IMEs are based on Pinyin.

    Ignoring tone, there are less than 500 pinyin syllables in Chinese

    Meanwhile, 3,000-20,000 Chinese characters are used, which depends on different application situations.

    For any case, main obstacle for pinyin IME is: let users choose the wanted character as fast as possible for any input pinyin syllable.

  • General Strategy

    For each inputted pinyin syllable, there are usually dozens of Chinese characters on mapping

    If, we input bi-syllable, tri-syllable, or even longer pinyin syllables at a time, then, much fewer character candidates are on the mapping list.

    Therefore, for more quick and accurate Chinese input, Input syllable sequence as long as possible!

  • Pinyin IME as Chinese character sequence decoding task

    Input: pinyin sequence

    Output: one-to-one mapping Chinese character sequence

    Sequence labeling task Maximum entropy: previous work

    Statistical machine translation: ours

    zi ran yu yan chu li

  • Chinese character sequence decoding: SMT Yang and Zhao, PACLIC 2012

    Pipeline No alignment learning

    Only adopt standard MERT tuning and MOSES decoding

    Effectively integrate language model and other linguistic features

    Accuracy and whole-sentence-accuracy both outperform previous maximum entropy model.

    10K 100K 1M

    ME 0.829 0.891 0.933

    SMT 0.947 0.952 0.955

    10K 100K 1M

    ME 0.075 0.169 0.302

    SMT 0.402 0.429 0.454

  • A close lexical connection on Chinese characters

    Japanese About more than 50% Japanese

    vocabulary come from Chinese. However, in modern times, a lot of words that represent modern western science, techniques and culture were first written in Japanese kanji, then passed back to Chinese.

    Korean Sino-Korean covers 60%

    Vietnamese Sino-Vietnamese covers 60%

    / /

    / /

    lich s l sh

    inh ngha dng y

    Phone phu fng f

    thi s sh sh

  • Both Vietnamese and Korean adopted alphabetized writing in modern times Japanese is a difference, whose writing is mixed with alphabets and

    Chinese characters, so that Chinese can guess what Japanese text means more or less.

    But Vietnamese and Korean

    For machine translation between Vietnamese/Korean and Chinese, it is very hard to collect sufficient parallel corpus.

  • Korean can be written in this way

    Sino-Korean writing:

    Korean only .

    Korean with Chinese () () .

    Korean and ChineseKorean as majority .

    Korean and ChineseChinese as majority .

  • Korean can be written in this way: South Korea's constitution, the first part

    31 419 , , , , , , 1948 7 12 8 .

    1

    1 .

    , .

    2 .

    .

    3 .

    4 , .

    5 .

    , .

  • meaning read Chinese character sequence Regarding the historic connection among all these languages in

    Sinosphere, we present a Chinese character transliteration form that follow strict lexical semantic equivalence for related machine translation.

    In term of Japanese word kun-yomi(), such a sequence of Chinese characters in word order of the original language is called meaning read Chinese character sequence (MRCCS,).

  • Language difference: Korean vs. Chinese

    Sound: Korean is spoken without tone (like Japanese), but Chinese has.

    Korean follows vowel harmony rules.

    Grammar Korean is SOV in its syntax (just like Japanese), while Chinese is SVO.

    Korean is agglutinative in its morphology, in which rich suffixes are used for meaning representation. Chinese is an isolating language, word order is its main grammatical means.

    Korean has five groups and nine parts of speech. Only words like noun, pronoun () can be translated.

  • Language difference: Vietnamese vs. Chinese Sound

    Both have tones, Vietnamese has six, and Chinese has five.

    Grammar Both are isolating (analytic) languages, and neither use morphological

    marking of case, gender, number or tense. Word order plays the most important grammar role in both languages. Both conform to SVO word order.

    As most south east Asian languages, Thai, Cambodian, Vietnamese is head-initial, which is quite different from Chinese. So the word, Vietnamese language,, should not beVitNam

    Tingin Vietnamese, but Ting Vit Nam. The phrase, the official language of Kinh people, should bengn nglanguage

    chnh thcoficialcaofdn tcpeoplesKinh.

  • Problems to be solved

    Grammar translation MRCCS is ungrammatical Chinese.

    Solution Re-phrasing based revising

    Simple solution: only use language model to perform reordering.

    From n-gram language to neural network language model

  • NNLM Background Neural network language models (NNLM), or continuous-space

    language models (CSLMs), have been shown to improve the

    performance of perplexity (PPL) and statistical machine

    translation (SMT) . However, CSLMs have not been used in the

    decoding, because using CSLMs in decoding takes a lot of time.

    We propose a method for converting CSLMs into back-off n-gram

    language models (BNLMs) so that we can use converted CSLMs in

    decoding.

  • CSLM

  • Why not CSLM in decoding?

    2000 NTCIR-9 English Sentences as test data.

    5-gram CSLM (4 context words) and BNLM trained from the

    same 1 million NTCIR-9 English sentences.

    Evaluate the probability of every n-gram.

    LMS CPU Time1 CPU Time2 CPU Time3

    BNLM 3.241 s 4.044 s 4.404 s

    CSLM 42.058 s 42.372 s 38.361 s

  • CSLM in SMT

    Training

    Tuning

    Decoding

    Decoding MERT

    CSLM

    N-best

    Result

    1st Pass

    CONV

    Re-rank

    Training

    Tuning

    Decoding

    Decoding MERT

    CSLM

    N-best

    Result

    1st Pass

    BNLM

    Re-rank

  • Conversion Method Text Data

    Converting

    Entropy Pruning

    2-gram CONV

    Renormalized back-off weights3-gram BNLM

    3-gram CONV

    Renormalized back-off weights4-gram BNLM

    Converting

    Entropy Pruning

    Converting

    Entropy Pruning

    4-gram CONV

    2-gram CSLM

    3-gram CSLM

    4-gram CSLM

    2-gram BNLM

    Append3-gram BNLM

    Append4-gram BNLM

    As BLM

    Arsoy et al. in ICASSP 2013

    Wang et al.(Ours) in EMNLP 2013

  • Experiments and Results Corpus:

    (1) NTCIR-9: 1 million sentences from Chinese to English

    (2) TED : 186K sentences from Chinese to English (additional monolingual corpus is hard to obtain)

  • Pinyin IME decoding with NNLM

    Test corpus LM One-Best N-best Character acc.

    10K trigram 0.7472 0.8992 0.968

    10K NNLM 0.7571 0.9014 0.968

    400K trigram 0.6702 0.8608 0.9546

    400K NNLM 0.6768 0.8645 0.9559

    5-3

  • A full example on Vietnamese to Chinese translation

    Du khch Ty Ban Nha thng thc tr ti Trm Anh qun .

    tourist Spain appreciate tea at Tram Anh shop .

    Parallel Du khch Ty Ban Nha thng thc tr ti Trm Anh qun.

    (1) MRCCS (2) Reorder with language model scores

    (it will be a precise translation besides preposition phrase should precede the verb in Chinese)

  • A full example on Vietnamese to Chinese translation

    Google translation Spanish tourists enjoy tea at the British outpost. (both Chinese and English translation are far from the correct meaning of the original Vietnamese text.)

    Why ? Consider Google translation converts the word British into ngi Anh in Vietnamese. Note that the above named entity word, Trm Anh, which also consists the same core syllable, Anh.

    As Google translation cannot find a good mapping for Trm Anh, it turns to use the English translation of ngi Anh instead, therefore, the incorrect translation British comes. We can derive from the above procedure that Google translation use English as pivot language for Vietnamese to Chinese translation.

    This shows that historic connection between related languages can help improve machine translation.

  • Vietnamese to MRCCS

  • Vietnamese to MRCCS

  • Conclusions

    Chinese character as pivot Accurate translation than before

    Using the same decoder to solve different problems.

    Neural network language model works

    Using Chinese characters as the latest writing system

  • PACLIC 29 2015@Shanghai

  • Thank you

    xie xie