Emnlp読み会@2017 02-15

EMNLP 2016 readingIncorporating Discrete Translation Lexicons

into Neural Machine Translation

author : Philip ArthurGraham Neubig, Satoshi Nakamura

presentation : Sekizawa YuukiKomachi lab M1

17/02/15 1

2

Incorporating Discrete Translation Lexiconsinto Neural Machine Translation

• NMT often mistakes traislatinglow-frequency content words• lose sentence meaning

• propose method• encode low-frequency words by lexicon probabilicity• 2methods : 1, use it as a bias 2, linear interpolation

• result (En-Ja translation, use two corpora (KFTT, BTEC) )• improve 2.0-2.3 BLEU, 0.13-0.44 NIST score• faster covergence time

17/02/15

3

NMT feature

• NMT system• treat each word in the vocabulary as a vector of continuous-

valued numbers • share statistical power between similar words

(“dog” and “cat”) or contexts (“this is” and “that is”) • drawback : often mistranslate into words that seem natural in the

context do not reflect the content of the source sentence.

• PBMT ・ SMT tend to rarely make this kind of mistake• base their translations on discrete phrase mappings

• ensure that source words will be translated into a target word that has been observed as a translation at least once in the training data

17/02/15

4

NMT

• source words• target words• translate probability

17/02/15

weight matrix bias vector

fixed-width vector

5

Integrating Lexicons into NMT

• Lexicon probability

17/02/15

lexical matrix by input sentence

alignmentprobability

vocab

input sentence words

6

combine lexicon probability

1. model bias

2. linear interpolation

17/02/15

x : learnable parameter(begin : 0.5)

prevent zeroprobabilityhere : 0.001

7

Constructing Lexicon Probability

1. automatically learning• use EM algorithm• E : count expected count : • M : lexicon probability

2. manual• use dictionary entry

as translation3. hybrid

17/02/15

all possible count

translation set of source word f

8

Experiment

• Dataset : KFTT, BTEC • English to Japanese• tokenize, lowercase• length <= 50• if low frequent word,

it replace <unk> and translate in test (Luong et al (2015) )• BTEC : less than 1, KFTT : less than 3

• Evaluation• BLEU, NIST, recall (rare words from references)

17/02/15

Data Corpus

Sentence TokensEn Ja

Train BTECKFTT

464K377K

3.60M 4.97M7.77M 8.04M

Dev BTECKFTT

5101,160

3.8K 5.3K 24.3K

26.8KTest BTEC

KFTT508

1,169 3.8K 5.5K

26.0K 28.4K

appear less than 8 times in target training corpus or references

vocab-size source target

BTEC 17.8k 21.8k

KFTT 48.2k 49.1k

9

Experiment

• method• pbmt : Koehn+ (2003) – use Moses• hiero (hierarchical pbmt) : Chiang+ (2007) – use travatar• attn : Bahdanau+ (2015) – attention NMT• auto-bias : proposed – automatic• hyb-bias : proposed – hybrid dictionary

• Lexicon• auto : training data (separately) with GIZA++• manual : English-Japanese dictionary – Eijiro : 104k entries• hyb : combine “auto” and “manual” lexicon

17/02/15

10

compare with related work

† : p < 0.05, * : p < 0.10

17/02/15

+2.3 +0.44 +30%

11

compare with related work

† : p < 0.05, * : p < 0.10

• KFTT : BLEU↑ 　 NIST↓ (compare with SMT)• traditional SMT systems have a small advantage

in translating low-frequency words

17/02/15

12

Translate examples

17/02/15

13

Training curves

• in KFTT• blue : attn• orange : auto-bias• green : hyb-bias

• first iteration : propose BLEU are higher than attn

• iteration time : 167minutes (attn) 275minutes (auto-bias)• due to calculate and use lexical probability matrix

17/02/15

14

Attention matrices

• proposed (bias)• more correct

• lighter color : stronger word attention• red box : correct alignment

17/02/15

15

proposed method resultfirst column without lexicon NMT

bias・ man is less effectivedue to coverage for target-domain words

linear・ reverse to bias・ worse than biasdue to constant interpolation coefficient

17/02/15

16

Incorporating Discrete Translation Lexiconsinto Neural Machine Translation

• NMT often mistakes traislatinglow-frequency content words

• propose method• encode low-frequency words by lexicon probabilicity• 2methods : 1, use it as a bias 2, linear interpolation

• improve 2.0-2.3 BLEU, 0.13-0.44 NIST score• faster covergence time

17/02/15

Education

Emnlp読み会@2017 02-15