72
Gated RNN & Sequence Generation Hung-yi Lee 李宏毅

Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

  • Upload
    others

  • View
    7

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Gated RNN &Sequence Generation

Hung-yi Lee

李宏毅

Page 2: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Outline

• RNN with Gated Mechanism

• Sequence Generation

• Conditional Sequence Generation

• Tips for Generation

Page 3: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

RNN with Gated Mechanism

Page 4: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Recurrent Neural Network

• Given function f: ℎ′, 𝑦 = 𝑓 ℎ, 𝑥

fh0 h1

y1

x1

f h2

y2

x2

f h3

y3

x3

……

No matter how long the input/output sequence is, we only need one function f

h and h’ are vectors with the same dimension

Page 5: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Deep RNN

f1h0 h1

y1

x1

f1 h2

y2

x2

f1 h3

y3

x3

……

f2b0 b1

c1

f2 b2

c2

f2 b3

c3

……

… ……

ℎ′, 𝑦 = 𝑓1 ℎ, 𝑥 𝑏′, 𝑐 = 𝑓2 𝑏, 𝑦 …

Page 6: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

f1h0 h1

a1

x1

f1 h2

a2

x2

f1 h3

a3

x3

f2b0 b1 f2 b2 f2 b3

Bidirectional RNN

x1 x2 x3

c1 c2 c3

f3 f3 f3y1 y2 y3

ℎ′, 𝑎 = 𝑓1 ℎ, 𝑥 𝑏′, 𝑐 = 𝑓2 𝑏, 𝑥

𝑦 = 𝑓3 𝑎, 𝑐

Page 7: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Naïve RNN

• Given function f: ℎ′, 𝑦 = 𝑓 ℎ, 𝑥

fh h'

y

x

Ignore bias here

h'

y Wo

Wh

h'= 𝜎

softmax

= 𝜎 h Wi x+

Page 8: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

LSTM

c changes slowly

h changes faster

ct is ct-1 added by something

ht and ht-1 can be very different

Naive ht

yt

xt

ht-1

LSTM

yt

xt

ct

htht-1

ct-1

Page 9: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

xt

zzizf zo

ht-1

ct-1

zxt

ht-1W= 𝑡𝑎𝑛ℎ

zixt

ht-1Wi= 𝜎

zfxt

ht-1Wf= 𝜎

zoxt

ht-1Wo= 𝜎

Page 10: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

xt

zzizf zo

ht-1

ct-1

“peephole”

z W= 𝑡𝑎𝑛ℎ

xt

ht-1

ct-1

diagonal

zizfzo obtained by the same way

Page 11: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

ht

𝑐𝑡 = 𝑧𝑓⨀𝑐𝑡−1+𝑧𝑖⨀𝑧

xt

zzizf zo

yt

ht-1

ct-1 ct

ℎ𝑡 = 𝑧𝑜⨀𝑡𝑎𝑛ℎ 𝑐𝑡⨀

⨀ ⨀

𝑦𝑡 = 𝜎 𝑊′ℎ𝑡

Page 12: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

LSTM

xt

zzizf zo

⨀ +

yt

ht-1

ct-1 ct

xt+1

zzizf zo

yt+1

ht

ct+1

⨀ ⨀

ht

Page 13: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

ht-1

GRU

r z

yt

xtht-1

h'

xt

⨀1-

+ ht

reset update

ℎ𝑡 = 𝑧⨀ℎ𝑡−1 + 1 − 𝑧 ⨀ℎ′

Page 14: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

LSTM: A Search Space Odyssey

Page 15: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

LSTM: A Search Space Odyssey

Standard LSTM works wellSimply LSTM: coupling input and forget gate, removing peepholeForget gate is critical for performanceOutput gate activation function is critical

Page 16: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

An Empirical Exploration of Recurrent Network

Architectures

Importance: forget > input > output

Large bias for forget gate is helpful

LSTM-f/i/o: removing forget/input/output gates

LSTM-b: large bias

Page 17: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

An Empirical Exploration of Recurrent Network

Architectures

Page 18: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Neural Architecture Search with

Reinforcement Learning

LSTM From Reinforcement Learning

Page 19: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Sequence Generation

Page 20: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Generation

• Sentences are composed of characters/words

• Generating a character/word at each time by RNN

fh h'

y

x The token generated at the last time step.

(represented by 1-of-N encoding)

Distribution over the token

(sampling from the distribution to generate a token)

x:

y:

10 0 0 0 0……你 我 他 是 很

00 0 0……0.7 0.3

Page 21: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Generation

• Sentences are composed of characters/words

• Generating a character/word at each time by RNN

fh0 h1

y1

x1

f h2

y2

x2

f h3

y3

x3

……

<BOS>

y1: P(w|<BOS>)

~

sample床 前 明~ ~

床 前

y2: P(w|<BOS>,床)

y3: P(w|<BOS>,床,前)

Until <EOS> is generated

Page 22: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Generation

• Training

fh0 h1

y1

x1

f h2

y2

x2

f h3

y3

x3

……

<BOS>

春 眠 不

春 眠

Training data: 春 眠 不 覺 曉

: minimizing cross-entropy

Page 23: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Generation

• Images are composed of pixels

• Generating a pixel at each time by RNN

Consider as a sentence

blue red yellow gray ……

Train a RNN based on the “sentences”

fh0 h1

y1

x1

f h2

y2

x2

f h3

y3

x3

……

<BOS>

~

samplered blue green~ ~

red blue

Page 24: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Generation

• Images are composed of pixels

3 x 3 images

Page 25: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Conditional Sequence Generation

Page 26: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Conditional Generation

• We don’t want to simply generate some random sentences.

• Generate sentences based on conditions:

Given condition:

Caption Generation

Chat-bot

Given condition:

“Hello”

“A young girl is dancing.”

“Hello. Nice to see you.”

Page 27: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Conditional Generation

• Represent the input condition as a vector, and consider the vector as the input of RNN generator

Image Caption Generation

Input image

CNN

A vector

. (perio

d)

……

<BOS>

A

wo

man

Page 28: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Conditional Generation

• Represent the input condition as a vector, and consider the vector as the input of RNN generator

• E.g. Machine translation / Chat-bot

機 習器 學

Information of the whole sentences

Jointly trainEncoder Decoder

Sequence-to-sequence learning

mach

ine

learnin

g

. (perio

d)

Page 29: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Conditional Generation

U: Hi

U: Hi

M: Hi

M: HelloM: Hi

M: Hello

Serban, Iulian V., Alessandro Sordoni, Yoshua Bengio, Aaron Courville, and Joelle Pineau, 2015 "Building End-To-End Dialogue Systems Using Generative Hierarchical Neural Network Models.

https://www.youtube.com/watch?v=e2MpOmyQJw4

Need to consider longer context during chatting

Page 30: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Dynamic Conditional Generation

機 習器 學

Decoder

mach

ine

learn

ing

. (perio

d)

Encoder

ℎ4ℎ3ℎ2ℎ1

𝑐1 𝑐2 𝑐3

𝑐1 𝑐2

Page 31: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Dynamic Conditional Generation

機 習器 學

Information of the whole sentences

Encoder Decoder

mach

ine

learnin

g

. (perio

d)

𝑐1 𝑐2 𝑐3

Page 32: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Machine Translation

• Attention-based model

𝑧0

機 習器 學

ℎ1 ℎ2 ℎ3 ℎ4

match

𝛼01

Cosine similarity of z and h

Small NN whose input is z and h, output a scalar

𝛼 = ℎ𝑇𝑊𝑧

Design by yourself

What is ? match

Jointly learned with other part of the network

match

ℎ 𝑧

𝛼

Page 33: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Machine Translation

• Attention-based model

𝛼01 𝛼0

2 𝛼03 𝛼0

4

𝑐0

𝑧1

Decoder input

𝑐0 = ො𝛼0𝑖ℎ𝑖

mach

ine

0.5 0.5 0.0 0.0

= 0.5ℎ1 + 0.5ℎ2

𝑧0

softmax

𝑐0

ො𝛼01 ො𝛼0

2 ො𝛼03 ො𝛼0

4

機 習器 學

ℎ1 ℎ2 ℎ3 ℎ4

Page 34: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Machine Translation

• Attention-based model

𝑧1

mach

ine

𝑧0

𝑐0

match

𝛼11

機 習器 學

ℎ1 ℎ2 ℎ3 ℎ4

Page 35: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Machine Translation

• Attention-based model

𝑧1

mach

ine

𝑧0

𝑐0

𝑐1

𝑧2

learnin

g

𝑐1

𝑐1 = ො𝛼1𝑖ℎ𝑖

= 0.5ℎ3 + 0.5ℎ4

𝛼11 𝛼1

2 𝛼13 𝛼1

4

0.0 0.0 0.5 0.5

softmax

ො𝛼11 ො𝛼1

2 ො𝛼13 ො𝛼1

4

機 習器 學

ℎ1 ℎ2 ℎ3 ℎ4

Page 36: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Machine Translation

• Attention-based model

𝑧1

mach

ine

𝑧0

𝑐0

𝑧2

learnin

g

𝑐1

match

𝛼21

The same process repeat until generating

<EOS>機 習器 學

ℎ1 ℎ2 ℎ3 ℎ4

……

……

……

Page 37: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Speech Recognition

William Chan, Navdeep Jaitly, Quoc

V. Le, Oriol Vinyals, “Listen, Attend

and Spell”, ICASSP, 2016

Page 38: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Image Caption Generation

filter filter filterfilter filter filter

match 0.7

CNN

filter filter filterfilter filter filter

𝑧0

A vector for each region

Page 39: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Image Caption Generation

filter filter filterfilter filter filter

CNN

filter filter filterfilter filter filter

A vector for each region

0.7 0.1 0.1

0.1 0.0 0.0

weighted sum

𝑧1

Word 1

𝑧0

Page 40: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Image Caption Generation

filter filter filterfilter filter filter

CNN

filter filter filterfilter filter filter

A vector for each region

𝑧0

0.0 0.8 0.2

0.0 0.0 0.0

weighted sum

𝑧1

Word 1

𝑧2

Word 2

Page 41: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Image Caption Generation

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML, 2015

Page 42: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Image Caption Generation

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, Yoshua Bengio, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML, 2015

Page 43: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Li Yao, Atousa Torabi, Kyunghyun Cho, Nicolas Ballas, Christopher Pal, Hugo Larochelle, Aaron Courville, “Describing Videos by Exploiting Temporal Structure”, ICCV, 2015

Page 44: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Question Answering

• Given a document and a query, output an answer

• bAbI: the answer is a word

• https://research.fb.com/downloads/babi/

• SQuAD: the answer is a sequence of words (in the input document)

• https://rajpurkar.github.io/SQuAD-explorer/

• MS MARCO: the answer is a sequence of words

• http://www.msmarco.org

• MovieQA: Multiple choice question (output a number)

• http://movieqa.cs.toronto.edu/home/

Page 45: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Memory Network Answer

Match Query

vector

Document

q

Extracted

Information

……𝑥1

𝛼1

=

𝑛=1

𝑁

𝛼𝑛𝑥𝑛

𝑥2 𝑥3 𝑥𝑁

𝛼2 𝛼3 𝛼𝑁

DNNSentence to vector can be jointly trained.

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, Rob Fergus, “End-To-End Memory Networks”, NIPS, 2015

Page 46: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Answer

Match Query

Document

q

Extracted

Information

……𝑥1

𝛼1

=

𝑛=1

𝑁

𝛼𝑛ℎ𝑛

𝑥2 𝑥3 𝑥𝑁

𝛼2 𝛼3 𝛼𝑁

Jointly learned……ℎ1 ℎ2 ℎ3 ℎ𝑁

DNN

MemoryNetwork

Hopping

Page 47: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

MemoryNetwork

q

……

……

Compute attention

Extract information

……

……

Compute attention

Extract information

DNN a

Page 48: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Reading Comprehension

• End-To-End Memory Networks. S. Sukhbaatar, A. Szlam, J. Weston, R. Fergus. NIPS, 2015.

The position of reading head:

Keras has example: https://github.com/fchollet/keras/blob/master/examples/babi_memnn.py

Page 49: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Wei Fang, Juei-Yang Hsu, Hung-yi Lee, Lin-Shan Lee, "Hierarchical Attention Model for Improved Machine Comprehension of Spoken Content", SLT, 2016

Page 50: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Neural Turing Machine

• von Neumann architecture

https://www.quora.com/How-does-the-Von-Neumann-architecture-provide-flexibility-for-program-development

Neural Turing Machine not only read from memory

Also modify the memory through attention

Page 51: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Neural Turing Machine

r0

y1

fh0

𝑚01 𝑚0

2 𝑚03 𝑚0

4

x1

ො𝛼01 ො𝛼0

2 ො𝛼03 ො𝛼0

4

𝑟0 = ො𝛼0𝑖𝑚0

𝑖

Retrieval process

Page 52: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Neural Turing Machine

𝑚01 𝑚0

2 𝑚03 𝑚0

4

ො𝛼01 ො𝛼0

2 ො𝛼03 ො𝛼0

4

𝑟0 = ො𝛼0𝑖𝑚0

𝑖

𝑒1 𝑎1𝑘1

𝛼1𝑖 = 𝑐𝑜𝑠 𝑚0

𝑖 , 𝑘1

ො𝛼11 ො𝛼1

2 ො𝛼13 ො𝛼1

4

softmax

𝛼11 𝛼1

2 𝛼13 𝛼1

4

r0

y1

fh0

x1

Page 53: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Neural Turing Machine

𝑚01 𝑚0

2 𝑚03 𝑚0

4

ො𝛼01 ො𝛼0

2 ො𝛼03 ො𝛼0

4

𝑒1 𝑎1𝑘1

𝑚11 𝑚1

2 𝑚13 𝑚1

4

𝑚1𝑖 𝑚0

𝑖= 𝑒1 𝑎1+ො𝛼1𝑖

(element-wise)

−ො𝛼1𝑖

ො𝛼11 ො𝛼1

2 ො𝛼13 ො𝛼1

4

𝑚0𝑖

0 ~ 1

Page 54: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Neural Turing Machine

𝑚01 𝑚0

2 𝑚03 𝑚0

4

ො𝛼01 ො𝛼0

2 ො𝛼03 ො𝛼0

4 ො𝛼11 ො𝛼1

2 ො𝛼13 ො𝛼1

4

𝑚11 𝑚1

2 𝑚13 𝑚1

4

ො𝛼21 ො𝛼2

2 ො𝛼23 ො𝛼2

4

𝑚21 𝑚2

2 𝑚23 𝑚2

4

r0

y1

fh0

x1 r1

y2

fh1

x2

Page 55: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Neural Turing Machine

Wei-Jen Ko, Bo-Hsiang Tseng, Hung-yi Lee, “Recurrent Neural Network based Language Modeling with Controllable External Memory”, ICASSP, 2017

Page 56: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Stack RNN

Armand Joulin, Tomas Mikolov, Inferring Algorithmic Patterns with Stack-Augmented Recurrent Nets, arXiv Pre-Print, 2015

stack

xt

yt

……

f

Push, Pop, Nothing0.7 0.2 0.1

Information to store

Pop NothingPush

X0.7 X0.2 X0.1

+ +… … …

Page 57: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Tips for Generation

Page 58: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Attention

Bad Attention

Good Attention: each input component has approximately the same attention weight

w1 w2 w3 w4

E.g. Regularization term: 𝑖𝜏 −

𝑡𝛼𝑡𝑖2

For each component Over the generation

𝛼11

𝛼𝑡𝑖

component

time

𝛼21𝛼3

1𝛼41 𝛼1

2𝛼22𝛼3

2𝛼42 𝛼1

3𝛼23𝛼3

3𝛼43 𝛼1

4𝛼24𝛼3

4𝛼44

(woman) (woman) …… no cooking

Kelvin Xu, Jimmy Ba, Ryan Kiros, Kyunghyun Cho, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, YoshuaBengio, “Show, Attend and Tell: Neural Image Caption Generation with Visual Attention”, ICML, 2015

Page 59: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Mismatch between Train and Test

• Training Reference:

A B B

A

B

𝐶 =

𝑡

𝐶𝑡

Minimizing cross-entropy of each component

: condition A

A

B

B

A

B

<BOS>

Page 60: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Mismatch between Train and Test

• GenerationA A

A

B

A

B

A

A

B

B

B

We do not know the reference

Testing: The inputs are the outputs of the last time step.

Training: The inputs are reference.

Exposure Bias<BOS>

Page 61: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

A B

A AB B

AB A B A B

AB

A B

A AB B

AB A B A B

AB

One step wrong

May be totally wrong

Never explore ……

一步錯,步步錯

Page 62: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Modifying Training Process?

A A

A

B

A

B

A

A

B

B

B

BA BReference

In practice, it is hard to train in this way.

Training is matched to testing.

When we try to decrease the loss for both steps 1 and 2 …..

A

Page 63: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

A

A

B

A

B

B

B

A

B

B

A

A

Scheduled Sampling

from model

From reference B

A

from model From

reference

Reference

Page 64: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Scheduled Sampling

• Caption generation on MSCOCO

BLEU-4 METEOR CIDER

Always from reference 28.8 24.2 89.5

Always from model 11.2 15.7 49.7

Scheduled Sampling 30.6 24.3 92.1

Samy Bengio, Oriol Vinyals, Navdeep Jaitly, Noam Shazeer, Scheduled Sampling for Sequence Prediction with Recurrent Neural Networks, arXiv preprint, 2015

Page 65: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Beam Search

A B

A AB B

A B A B AB

AB

0.4

0.9

0.9

0.6

0.4

0.4

0.6

0.6

The green path has higher score.

Not possible to check all the paths

Page 66: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Beam Search

A B

A AB B

A B A B AB

AB

Keep several best path at each step

Beam size = 2

Page 67: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Beam Search

https://github.com/tensorflow/tensorflow/issues/654#issuecomment-169009989

The size of beam is 3 in this example.

Page 68: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Better Idea?

A

A

B

A

B

B

B

<BOS>

A

B

A

B

<BOS> A

B

Ayou

you are

you

I ≈ you

high score

I ≈ you

am ≈ are

I am ……

You are ……

I are ……

You am ……

Page 69: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Object level v.s. Component level

• Minimizing the error defined on component level is not equivalent to improving the generated objects

𝐶 =

𝑡

𝐶𝑡

Ref: The dog is running fast

A cat a a a

The dog is is fast

The dog is running fastCross-entropy of each step

Optimize object-level criterion instead of component-level cross-entropy. object-level criterion: 𝑅 𝑦, ො𝑦

𝑦: generated utterance, ො𝑦: ground truth

Gradient Descent?

Page 70: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Reinforcement learning?Start with

observation 𝑠1 Observation 𝑠2 Observation 𝑠3

Action 𝑎1: “right”

Obtain reward 𝑟1 = 0

Action 𝑎2 : “fire”

(kill an alien)

Obtain reward 𝑟2 = 5

Page 71: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Reinforcement learning?

Marc'Aurelio Ranzato, SumitChopra, Michael Auli, WojciechZaremba, “Sequence Level Training with Recurrent Neural Networks”, ICLR, 2016

A A

A

B

A

B

A

A

B

B

Bobservation

Actions set

The action we take influence the observation in the next step

<BO

S>

Action taken𝑟 = 0 𝑟 = 0

𝑟𝑒𝑤𝑎𝑟𝑑:

R(“BAA”, reference)

Page 72: Gated RNN & Sequence Generationyvchen/f106-adl/doc/171030... · 2017. 10. 30. · Recurrent Neural Network •Given function f: ℎ′, = ℎ, h0 f h1 y1 x1 f h2 y2 x2 f h3 y3 x3

Concluding Remarks

• RNN with Gated Mechanism

• Sequence Generation

• Conditional Sequence Generation

• Tips for Generation