Transcript
  • 10.11.19 09:43

    Neural Machine Translation2 CMPT 413/825, Fall 2019

    Jetic Gū

    School of Computing Science

    12 Nov. 2019

  • Overview• Focus: Neural Machine Translation

    • Architecture: Encoder-Decoder Neural Network

    • Main Story:

    • Extension to Seq2Seq: Copy Mechanism

    • Extension to Seq2Seq: Ensemble

    • Extension to Seq2Seq: BeamSearch

    • [Extra] Beyond Seq2Seq: Attention is all you need

    • [Extra] Beyond NMT2

  • DecoderEncoder

    Sequence-to-Sequence

    Revie

    wPr(E |F) = ΠtPr(e′ = et |F, e

  • Attention

    Revie

    w

    !

    RNN RNN RNN

    prof likes

    prof likes cheese

    RNN

    cheese

    ! ! !

    RNN

    Encoder

    RNNRNN RNN RNN RNN RNN

    ?

    RNN

    was

    RNN

    hat

    RNN

    prof.

    RNN

    sarkar

    RNN

    gegessen

    RNN

    Decoder

    P0 Review

  • Attention

    Revie

    w

    !

    Frau

    in

    Gelb

    mit

    einem

    Kinderwagen

    Mann

    mit

    blauem

    Hemd

    und

    ein

    Sour

    ce (G

    erm

    an)

    wom

    an in

    shirt

    with 1a

    stro

    ller

    1and 2a

    man in 3a

    blue

    shirt

    Prediction (English)

    gold

    P0 Review

  • Attention• Why does Attention work?

    • What is in the context vector?

    • What is in ?

    • It’s alignment1 !

    • learns to refer to useful information in src2

    • similar to human attention: we pay attention to whatever is needed

    α

    Revie

    w

    P0 XXXXXXX !

    contextt = ∑i

    αt,ihenci

    scoret,i = f(henci , hdect )

    αt = softmax(scoreti)

    1. CL2015008 [Bahdanau et al.] Neural Machine Translation by Jointly Learning to Align and Translate 2. CL2017342 [Ghader et Monz] What does Attention in Neural Machine Translation Pay Attention to?

    P0 Review

  • Common Problems of NMT

    • Out-of-Vocabulary (OOV) Problem; Rare word problem

    • Frequent word are translated correctly, rare words are not

    • In any corpus, word frequencies are exponentially unbalanced

    Revie

    w

    P0 XXXXXXX

    P0 Review

  • OOV

    Revie

    w

    P0 XXXXXXX

    1. https://en.wikipedia.org/wiki/Zipf%27s_law

    P0 Review

    Zipf’s Law

    Occ

    urre

    nce

    Cou

    nt

    1

    10

    100

    1000

    10000

    100000

    1000000

    Frequency Rank1 25000 50000

    • rare word are exponentially less frequent than frequent words

    • e.g. in HW4, 45%|65% (src|tgt) of the unique words occur once

  • Common Problems of NMT

    • Out-of-Vocabulary (OOV) Problem; Rare word problem

    • Frequent word are translated correctly, rare words are not

    • In any corpus, word frequencies are exponentially unbalanced

    • Under translation

    • Crucial information are left untranslated; premature generation

    Revie

    w

    P0 XXXXXXX

    P0 Review

  • • Out-of-Vocabulary (OOV) Problem; Rare word problem

    • Frequent word are translated correctly, rare words are not

    • In any corpus, word frequencies are exponentially unbalanced

    • Under translation

    • Crucial information are left untranslated; premature generation

    Under-trans

    Revie

    w

    P0 XXXXXXX

    P0 Review

    !

    RNN

    nobody

    RNN

    expects

    expects

    RNN

    the

    the

    RNN

    spanish

    spanish

    RNN

    inquisition

    inquisition

    RNN

    !

    RNN

    nobody

  • • Out-of-Vocabulary (OOV) Problem; Rare word problem

    • Frequent word are translated correctly, rare words are not

    • In any corpus, word frequencies are exponentially unbalanced

    • Under translation

    • Crucial information are left untranslated; premature generation

    Under-trans

    Revie

    w

    P0 XXXXXXX

    P0 Review

    !

    RNN

    nobody

    RNN

    expects

    expects

    RNN

    the

    the

    RNN

    spanish

    spanish

    RNN

    inquisition

    inquisition

    RNN

    !

    RNN

    nobody

    'inquisition': 0.49 : 0.51

  • • Out-of-Vocabulary (OOV) Problem; Rare word problem

    • Frequent word are translated correctly, rare words are not

    • In any corpus, word frequencies are exponentially unbalanced

    • Under translation

    • Crucial information are left untranslated; premature generation

    Under-trans

    Revie

    w

    P0 XXXXXXX

    P0 Review

    !

    RNN

    nobody

    RNN

    expects

    expects

    RNN

    the

    the

    RNN

    spanish

    spanish

    RNN

    inquisition

    inquisition

    RNN

    !

    RNN

    nobody

    'inquisition': 0.49 : 0.51

  • Common Problems of NMT

    Revie

    w

    P0 XXXXXXX

    P0 Review

    • Out-of-Vocabulary (OOV) Problem; Rare word problem

    • Frequent word are translated correctly, rare words are not

    • In any corpus, word frequencies are exponentially unbalanced

    • Under translation

    • Crucial information are left untranslated; premature generation

    • Out-of-Vocabulary (OOV) Problem; Rare word problem

    • Under translation

  • Common Problems of NMT

    Revie

    w

    P0 XXXXXXX

    P0 Review

    • Out-of-Vocabulary (OOV) Problem; Rare word problem

    • Copy Mechanisms

    • Char-level Encoder (oops for logogram, e.g. Chinese)

    • Under translation

    • Ensemble

    • Beam search

    • Coverage models

    • Out-of-Vocabulary (OOV) Problem; Rare word problem

    • Under translation

  • Copy Mechanism

    Conc

    ept

    P0 XXXXXXX

    P1 Copy

    !

    RNN RNN RNN

    prof loves

    prof loves cheese

    RNN

    cheese

    ! ! !

    RNN

    1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation

  • Copy Mechanism

    Conc

    ept

    P0 XXXXXXX

    P1 Copy

    !

    RNN RNN RNN

    prof loves

    prof loves cheese

    RNN

    cheese

    ! ! !

    RNN

    prof

    liebt

    kebab

    1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation

  • Copy Mechanism

    Conc

    ept

    P0 XXXXXXX

    P1 Copy

    !

    RNN RNN RNN

    prof loves

    prof loves cheese

    RNN

    cheese

    ! ! !

    RNN

    prof

    liebt

    kebab

    αt,1 = 0.1

    αt,2 = 0.2

    αt,3 = 0.7

    1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation

  • Copy Mechanism

    Conc

    ept

    P0 XXXXXXX

    P1 Copy

    !

    RNN RNN RNN

    prof loves

    prof loves cheese

    RNN

    cheese

    ! ! !

    RNN

    prof

    liebt

    kebab

    αt,1 = 0.1

    αt,2 = 0.2

    αt,3 = 0.7

    kebab

    1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation

  • Copy Mechanism

    Conc

    ept

    P0 XXXXXXX

    P1 Copy

    !

    RNN RNN RNN

    prof loves

    prof loves cheese

    ! !

    RNN

    prof

    liebt

    kebab

    αt,1 = 0.1

    αt,2 = 0.2

    αt,3 = 0.7

    kebab

    • Sees at step

    • looks at attention weight

    • replace with the source word

    t

    αt

    fargmaxiαt,i

    1. CL2015015 [Luong et al.] Addressing the Rare Word Problem in Neural Machine Translation

  • *Pointer-Generator Copy Mechanism

    Conc

    ept

    P0 XXXXXXX

    P1 Copy

    1. CL2016032 [Gulcehre et al.] Pointing the Unknown Words

    !

    RNN RNN RNN

    prof loves

    prof loves cheese

    RNN

    cheese

    ! ! !

    RNN

    prof

    liebt

    kebab

    αt,1 = 0.1

    αt,2 = 0.2

    αt,3 = 0.7

    kebab

  • *Pointer-Generator Copy Mechanism

    Conc

    ept

    P0 XXXXXXX

    P1 Copy

    1. CL2016032 [Gulcehre et al.] Pointing the Unknown Words

    RNN

    !

    pg

    !

    RNN RNN RNN

    prof loves

    prof loves cheese

    RNN

    cheese

    ! ! !

    RNN

    prof

    liebt

    kebab

    αt,1 = 0.1

    αt,2 = 0.2

    αt,3 = 0.7

    kebab

    • binary classifier switch • use decoder output • use copy mechanism

  • *Pointer-Generator Copy Mechanism

    Conc

    ept

    P0 XXXXXXX

    P1 Copy

    1. CL2016032 [Gulcehre et al.] Pointing the Unknown Words

    RNN

    !

    pg

    • Sees at step , or

    • looks at attention weight

    • replace with the source word

    t pg([hdect ; contextt]) ≤ 0.5

    αt

    fargmaxiαt,i

  • *Pointer-Based Dictionary Fusion

    Conc

    ept

    P0 XXXXXXX

    P1 Copy

    1. CL2019331 [Gū et al.] Pointer-based Fusion of Bilingual Lexicons into Neural Machine Translation

    RNN

    !

    pg

    • Sees at step , or

    • looks at attention weight

    • replace with translation of the source word

    t pg([hdect ; contextt]) ≤ 0.5

    αt

    dict( fargmaxiαt,i)

  • Ensemble

    • Similar to voting mechanism, but with probabilities

    • multiple models of different parameters (usually from different checkpoints of the same training instance)

    • use the output with the highest probability across all models

    Conc

    ept

    P0 XXXXXXX

    P2 Ensemble

  • Ensemble

    Demo

    P0 XXXXXXX

    P2 Ensemble

    !

    RNN RNN RNN

    prof loves

    prof loves cheese

    RNN

    cheese

    ! ! !

    RNN

  • Ensemble

    Demo

    P0 XXXXXXX

    P2 Ensemble

    RNN RNN RNN

    prof loves cheese

    RNN

    RNNRNN RNN RNN

    prof loves cheese

    RNN

    RNNRNN RNN RNN

    prof loves cheese

    RNN

    RNN

  • Ensemble

    Demo

    P0 XXXXXXX

    P2 Ensemble

    RNN RNN RNN

    prof loves cheese

    RNN

    RNNRNN RNN RNN

    prof loves cheese

    RNN

    RNNRNN RNN RNN

    prof loves cheese

    RNN

    RNN

    seq2seq_E049.pt

  • Ensemble

    Demo

    P0 XXXXXXX

    P2 Ensemble

    RNN cp2 cp2

    prof loves cheese

    cp2

    cp2

    RNN cp3 cp3

    prof loves pizza

    cp3

    cp3

    RNN cp1 cp1

    prof loves kebab

    cp1

    cp1

    seq2seq_E049.pt

    seq2seq_E048.pt

    seq2seq_E047.pt

  • Ensemble

    Demo

    P0 XXXXXXX

    P2 Ensemble

    RNN cp2 cp2

    prof loves cheese

    cp2

    cp2

    RNN cp3 cp3

    prof loves pizza

    cp3

    cp3

    RNN cp1 cp1

    prof loves kebab

    cp1

    cp1

    seq2seq_E049.pt

    seq2seq_E048.pt

    seq2seq_E047.pt

    'kebab': 0.9 'cheese': 0.05 'pizza': 0.05

    ……

    'kebab': 0.3 'cheese': 0.55 'pizza': 0.05

    ……

    'kebab': 0.3 'cheese': 0.05 'pizza': 0.55

    ……

  • Ensemble

    Demo

    P0 XXXXXXX

    P2 Ensemble

    RNN cp2 cp2

    prof loves cheese

    cp2

    cp2

    RNN cp3 cp3

    prof loves pizza

    cp3

    cp3

    RNN cp1 cp1

    prof loves kebab

    cp1

    cp1

    seq2seq_E049.pt

    seq2seq_E048.pt

    seq2seq_E047.pt

    'kebab': 0.9 'cheese': 0.05 'pizza': 0.05

    ……

    'kebab': 0.3 'cheese': 0.55 'pizza': 0.05

    ……

    'kebab': 0.3 'cheese': 0.05 'pizza': 0.55

    ……

  • 'kebab': 0.3 'cheese': 0.55 'pizza': 0.05

    ……

    'kebab': 0.3 'cheese': 0.05 'pizza': 0.55

    ……

    Ensemble

    Demo

    P0 XXXXXXX

    P2 Ensemble

    RNN cp2 cp2

    prof loves cheese

    cp2

    cp2

    RNN cp3 cp3

    prof loves pizza

    cp3

    cp3

    RNN cp1 cp1

    prof loves kebab

    cp1

    cp1

    seq2seq_E049.pt

    seq2seq_E048.pt

    seq2seq_E047.pt

    'kebab': 0.9 'cheese': 0.05 'pizza': 0.05

    ……

    kebab

    kebab

  • Ensemble

    Demo

    P0 XXXXXXX

    P2 Ensemble

    RNN cp2 cp2

    prof loves cheese

    cp2

    cp2

    RNN cp3 cp3

    prof loves pizza

    cp3

    cp3

    RNN cp1 cp1

    prof loves kebab

    cp1

    cp1

    seq2seq_E049.pt

    seq2seq_E048.pt

    seq2seq_E047.pt

    kebab

    kebab

  • BeamSearch

    • Consider multiple hypothesis at each step , formulate decoding as a search programme on a tree

    t

    Conc

    ept

    P0 XXXXXXX

    P3 BeamSearch

  • BeamSearch

    Demo

    P0 XXXXXXX

    P3 BeamSearch

    RNN RNN RNN

    prof loves cheese

    RNN

    RNN

  • johnterryeric

    BeamSearch (Size=3)

    Demo

    P0 XXXXXXX

    P3 BeamSearch

    RNN RNN RNN

    prof loves cheese

    RNN

    RNN

  • john

    eric

    terry

    BeamSearch (Size=3)

    Demo

    P0 XXXXXXX

    P3 BeamSearch

    RNN RNN RNN

    prof loves cheese

    RNN

    RNN

    P = 0.5

    P = 0.4

    P = 0.09

    P = 0.01

  • john

    eric

    terry

    BeamSearch (Size=3)

    Demo

    P0 XXXXXXX

    P3 BeamSearch

    RNN RNN RNN

    prof loves cheese

    RNN

    RNN

    P = 0.5

    P = 0.4

    P = 0.09

    P = 0.01

  • john

    eric

    BeamSearch (Size=3)

    Demo

    P0 XXXXXXX

    P3 BeamSearch

    RNN RNN RNN

    prof loves cheese

    RNN

    RNN

    P = 0.5

    P = 0.4

    P = 0.09

  • john

    eric

    BeamSearch (Size=3)

    Demo

    P0 XXXXXXX

    P3 BeamSearch

    RNN RNN RNN

    prof cheese

    RNN

    RNN

    RNN

    RNN

  • john

    eric

    BeamSearch (Size=3)

    Demo

    P0 XXXXXXX

    P3 BeamSearch

    RNN RNN RNN

    prof loves cheese

    RNN

    RNN

    RNN

    RNN

    likes

    wants

    loves

    meets

    kicks

    is

    has

  • john

    eric

    BeamSearch (Size=3)

    Demo

    P0 XXXXXXX

    P3 BeamSearch

    RNN RNN RNN

    prof loves cheese

    RNN

    RNN

    RNN

    RNN

    likes

    wants

    loves

    meets

    kicks

    is

    has

  • john

    eric

    BeamSearch (Size=3)

    Demo

    P0 XXXXXXX

    P3 BeamSearch

    RNN RNN RNN

    prof loves cheese

    RNN

    RNN

    RNN

    RNN

    likes

    wants

    loves

    meets

    kicks

    is

    has

  • john

    eric

    BeamSearch (Size=3)

    Demo

    P0 XXXXXXX

    P3 BeamSearch

    RNN RNN RNN

    prof loves cheese

    RNN

    RNN

    RNN

    RNN

    likes

    loves

  • john

    eric

    BeamSearch (Size=3)

    Demo

    P0 XXXXXXX

    P3 BeamSearch

    RNN RNN RNN

    prof loves cheese

    RNN

    RNN

    RNN

    RNN

    likes

    loves

    RNN

    RNN

  • john

    eric

    BeamSearch (Size=3)

    Demo

    P0 XXXXXXX

    P3 BeamSearch

    RNN RNN RNN

    prof loves cheese

    RNN

    RNN

    RNN

    RNN

    likes

    loves

    RNN

    RNN

    pizza

    kebab

  • john

    eric

    BeamSearch (Size=3)

    Demo

    P0 XXXXXXX

    P3 BeamSearch

    RNN RNN RNN

    prof loves cheese

    RNN

    RNN

    RNN

    RNN

    likes

    loves

    RNN

    RNN

    pizza

    kebab

    RNN

    RNN

  • john

    eric

    BeamSearch (Size=3)

    Demo

    P0 XXXXXXX

    P3 BeamSearch

    RNN RNN RNN

    prof loves cheese

    RNN

    RNN

    RNN

    RNN

    likes

    loves

    RNN

    RNN

    pizza

    kebab

    RNN

    RNN

  • john

    eric

    BeamSearch (Size=3)

    Demo

    P0 XXXXXXX

    P3 BeamSearch

    RNN RNN RNN

    prof loves cheese

    RNN

    RNN

    RNN

    RNN

    likes

    loves

    RNN

    RNN

    pizza

    kebab

    RNN

    RNN

    Output: "prof loves cheese"

  • john

    eric

    BeamSearch (Size=3)

    Demo

    P0 XXXXXXX

    P3 BeamSearch

    RNN

    prof loves cheese

    likes

    loves

    pizza

    kebab

    Output: "prof loves cheese"

  • eric

    BeamSearch (Size=3)

    Demo

    P0 XXXXXXX

    P3 BeamSearch

    RNN

    prof loves cheese

    loves

    pizza

    kebab

    Output: "prof loves cheese"

  • eric

    BeamSearch (Size=3)

    Demo

    P0 XXXXXXX

    P3 BeamSearch

    RNN

    prof loves cheese

    loves

    pizza

    kebab

    Output: "prof loves cheese"

    P(E |F)

    P(E |F)

    P(E |F)

  • eric

    BeamSearch (Size=3)

    Demo

    P0 XXXXXXX

    P3 BeamSearch

    RNN

    prof loves cheese

    loves

    pizza

    kebab

    Output: "prof loves cheese"

    P('prof loves cheese ' |F)

    P('eric loves kebab ' |F)

    P('eric loves pizza ' |F)

  • BeamSearch

    • Consider multiple hypotheses at each step , formulate decoding as a search programme on a tree

    • Consider multiple complete translation hypotheses, but reduces the total search space from to

    t

    |VE |m

    Conc

    ept

    P0 XXXXXXX

    P3 BeamSearch

  • BeamSearch

    • Consider multiple hypotheses at each step , formulate decoding as a search programme on a tree

    • Consider multiple complete translation hypotheses, but reduces the total search space from to

    t

    |VE |m

    Conc

    ept

    P0 XXXXXXX

    P3 BeamSearch

    batchSize × m

  • Self-Attentive NMT

    Revie

    w

    P0 XXXXXXX

    Extra1 Transformer

    RNN

    Encoder Representations

    I like cheese

    score calculation

    !

    Encoder Representations

    I like cheese

    x x x+ + =

    DecoderRNNState

    α1 α2 α3;[ ]ht

    • Aggregating src information , with as query

    henc1:|F|hdect

  • Self-Attentive NMT

    Revie

    w

    P0 XXXXXXX

    Extra1 Transformer

    • Aggregating src information , with as query

    • : aggregating information , with as query

    • : aggregating information , with as query

    • Can we replace the RNN with attention?

    henc1:|F| hdect

    h enc fi fi

  • Self-Attentive NMT

    • Transformer1

    • Encoder-Decoder

    • No RNN -> all RNNs in Seq2Seq replaced by identical self-attention blocks

    • Multi-headed attention blocks*

    • read the paper for more information

    Conc

    ept

    P0 XXXXXXX

    Extra1 Transformer

    1. CL2017244 [Vaswani et al.] Attention Is All You Need

  • Self-Attentive NMT

    • Does everything Seq2Seq does

    • BERT: Bidirectional Encoder Representation of Transformer

    Conc

    ept

    P0 XXXXXXX

    Extra1 Transformer

    1. CL2017244 [Vaswani et al.] Attention Is All You Need 2. CL2018460 [Devlin et al.] BERT Pre-training of Deep Bidirectional Transformers for Language Understanding

  • Beyond NMT

    • NMT weaknesses

    • Longer src sentences -> SMT does it better, augment attention?

    • Growing lexicon/Terminologies -> Pointer-based Dictionary Fusion

    • Sensitive domain -> offline computing, human-involved system to ensure accuracy

    • Mobile Platform deployment -> model compression

    Conc

    ept

    P0 XXXXXXX

    Extra2 Beyond NMT

  • Beyond NMT

    • NMT weaknesses

    • Longer src sentences -> SMT does it better, augment attention?

    • Growing lexicon/Terminologies -> Pointer-based Dictionary Fusion

    • Sensitive domain -> offline computing, human-involved system to ensure accuracy

    • Mobile Platform deployment -> model compression

    Conc

    ept

    P0 XXXXXXX

    Extra2 Beyond NMT

    HYBRID

  • PURISM FOREVER!

    MAKE NMT GREAT AGAIN

    ATTENTION IS ALL YOU

    NEED

    HYBRIDS ARE DUMB

    Beyond NMT

    Conc

    ept

    P0 XXXXXXX

    Extra2 Beyond NMT

    HYBRID

  • My work here is done, thank you.


Recommended