Modeless Japanese Input Method

Hybrid method for modeless Japanese input

using N-gram based binary classification and dictionary

Yukino IkegamiSetsuo Tsuruta

2014/01/20

Necessity of Japanese Input Method

• Japanese has many characters– Kana

• Hiragana– 81 characters e.g.) いろはにほへと

• Katakana– 81 characters e.g.) イロハニホヘト

– Kanji (Chinese-characters)• More than 6,000 characters e.g.) 以呂波仁保反止

• We can’t input directly by a keyboard Japanese input method (Converting alphabet to Japanese

character) is necessary2

If all Japanese characters are assigned to each key…

• Toooo many keys!• Japanese input method is necessary

Japanese Input Method-Roman to Kana-Kanji Converter-

• Flow1. Receive the Romanized alphabets

2. Convert the Romanized alphabetsinto Kana using Roman-to-Kana table

3. Convert Kana into Kanji (if necessary)

①n e k o d e s u

② ねこです

③ 猫です

4

Problems on Japanese Input Method

• Need to switch input modes between Japanese and ASCII

e.g. To input ‘ あれは 8Byte です’ (That is 8Byte)

areha [Return][ASCII Mode] 8byte [Japanese Mode] desu　　　　　 Switching 　　　　　　 Switching

• Switching is cumbersome!

5

Adding Term to Dictionaryfor Switching Mode Problem

• Adding term of other languages to dictionary of conventional input method editor

• Shortcoming– New term is created continuously– Homograph problem

Related Work

• Modeless Pinyin-Chinese Input [Chen et al. 2000]– Convert alphabet (Pinyin) to Chinese– Using word-surface feature only for classification

• Type-Any [Ehara et al. 2009]– Convert Alphabet to Any Language– Need press Delimiter-key when converting– Using word-surface feature only for classification

7

Approach-Modeless Japanese Input Method-

• Automatically switching input mode

1. Generate discriminating model by Support Vector Machine (SVM)– the model describe multiple n-gram features

2. Distinguish a segment whether Kana or not in alphabet sequences using the discriminating model– e.g. nekohacatdesu → nekoha / cat / desu → ねこは cat です　　　　　　　　　　　 Japanese / English / Japanese

8

Main flow of Modeless Japanese Input Method

each character in user inputs

if character is still ASCII?

Kana conversion

System Response(Kana & alphabet sequence)

User input(alphabet sequence)

True

FalseKana-conversion

DiscriminativeModel

9

Non Japanese Dic.

Flow of Generating Discriminative Model

• 猫は cat ですLoad Texts

• Using Japanese Morphological Analyzer (MeCab)• ネコハ cat デスKanji to Kana

• Using Kana to ASCII table (used by Google Japanese input)• nakohacatdesuKana to ASCII• character-surface: ne, ek, nek, ko, eko, oh, koh, ha, oha... • character-type: LL, LL, LLL, LL, LLL, LL, LLL...• History: KK,KK, KKK, KK, KKK, KKK...

ASCII to n-gram

• 1, 3, 4, 13, 22...n-gram to ID

• 1:1, 3:1, 4:1, 13:1, 32:1...Describe as binary model

• 1.344, 0.691, 0,023, -1.398...Learning on SVM10

n-gram Features　　　　　　あ　れ　　は　 8 　 B 　 y 　 t 　 e

a r e h a 8 B y t e(in case of n-gram upper limit n = 2, window size m = 2, focus-point xi = 2nd “a”)

• Character-Surface– Substring of backward and forward at focus point– e.g.) -2/ha -1/a8 0/8B 1/By

• Character-Type– Upper-case(U), Lower-case(L), Number(N), and

Symbol(S).– e.g.) -2/LL -1/LN 0/NU 1/UL

11

Generating Non-Japanese Dictionary

• Words never appeared in Japanese only text– More than 5 length– Contains substring can’t convert to Kana

• Source– Corpus of Contemporary American English (COCA)– Japanese Wikipedia article title list

12

Compare with Conventional IMEConventional method

areha [Return][Alphabet Mode] 8Byte [Japanese Mode] desu　　　　　 Switching 　　 SwitchingTyping : 17

• The number of typing key is decreased

Modeless Japanese input method

areha8Bytedesu

Typing : 14

13

Datasetsused in Evaluation Experiment

• Generating Model & Evaluating Method– Balanced Corpus of Contemporary Written

Japanese (BCCWJ)• book, magazine, blog, government document and

others

• Non Japanese Dictionary Source– COCA– Japanese Wikipedia article title list

14

Criteria

Results of Evaluation

• Outperforms baseline

Baseline(Char. surface

n-gram)

Proposed method(Char. {surface, type}n-gram & Dictionary)

Kana Precision .998 .999ASCII Precision .989 .996

Kana Recall .993 .998ASCII Recall .780 .884

Kana F1-measure .953 .968ASCII F1-measure .858 .924

16

User test

• Outperforms conventional method

Person No. 1 2 3 4 5 6 7 8 9Conventional IME 18.18 17.89 15.4 12.71 11.09 10.18 11.42 12.38 10.48

Proposed method 13.34 14.68 9.88 12.23 6.03 7.00 11.03 11.37 10.30

17

…

• 4 females and 7 males• Input example sentences (chat, mail, technological

text)

Summary

• Switching input mode is cumbersome• Hybrid Modeless Japanese Input Method– Automatically switching input mode between

Japanese and ASCII– Using n-gram features model for discrimination• character-{surface, type}

– Outperforms conventional methods

18

Engineering

Modeless Japanese Input Method