43
SD Study RNN & LSTM 2016/11/10 Seitaro Shinagawa 1/43

SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

SD StudyRNN & LSTM 2016/11/10

Seitaro Shinagawa

1/43

Page 2: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

This is a description for people

who have already understood

simple neural network

architecture like feed forward

networks.

2/43

Page 3: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

I will introduce LSTM,

how to use, tips in chainer.

3/43

Page 4: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

わかるLSTM ~ 最近の動向と共に から引用( http://qiita.com/t_Signull/items/21b82be280b46f467d1b )

1.RNN to LSTM

Output

Layer

Middle(Hidden)

Layer

Input

Layer

Simple RNN

4/43

Page 5: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

FAQ with LSTM beginner students.

LSTM LSTMLSTM

I hear LSTM is kind of RNN,

but LSTM looks different architecture…

These have same architecture!

Please follow me! Neural bear

𝒙𝒕

𝒉𝒕

𝒚𝒕

A-san

A-san often sees this RNN A-san often sees this LSTM

𝒙𝟏 𝒙𝟐 𝒙𝟑

𝒚𝟏 𝒚𝟐 𝒚𝟑These are same?

different?

5/43

Page 6: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

𝒙𝒕

𝒉𝒕

𝒚𝒕

Introduce LSTM figure from RNN

6/43

Page 7: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

𝒙𝒕

𝒉𝒕

𝒚𝒕

Unroll on time scale

Introduce LSTM figure from RNN

7/43

Page 8: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

𝒙𝟏

𝒉𝟏

𝒚𝟏

𝒉𝟎

𝒙𝟐

𝒉𝟐

𝒚𝟐

𝒙𝟑

𝒉𝟑

𝒚𝟑

Oh, I often see this in RNN!

𝒉𝑡 = tanh 𝑾𝑥ℎ𝒙𝑡 +𝑾ℎℎ𝒉𝑡−1𝒚𝑡 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑾ℎ𝑦𝒉𝑡

So, this figure focuses on variables

and shows that their relationships.

Unroll on time scale

Introduce LSTM figure from RNN

8/43

Page 9: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

Let’s focus on the more actual process

𝒙𝟏

𝒉𝟏

𝒚𝟏

𝒉𝟎

𝒙𝟐

𝒉𝟐

𝒚𝟐

I try to write the architecture detail.

𝒗𝐭 = 𝐖ℎ𝑦𝒉𝒕

𝐲𝐭 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝒗𝐭

is function

𝒉𝐭 = 𝑡𝑎𝑛ℎ 𝒖𝐭

𝒖𝑡 = 𝑾𝑥ℎ𝒙𝑡 +𝑾ℎℎ𝒉𝑡−1

𝒗𝐭 = 𝐖ℎ𝑦𝒉𝒕

𝐲𝐭 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝒗𝐭

𝒉𝐭 = 𝑡𝑎𝑛ℎ 𝒖𝐭

𝒖𝑡 = 𝑾𝑥ℎ𝒙𝑡 +𝑾ℎℎ𝒉𝑡−1

See RNN as a large function with

input (𝒙𝑡 , 𝒉𝑡−1) and return (𝒚𝑡 , 𝒉𝑡)

9/43

Page 10: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

𝒙𝟏

𝒉𝟏

𝒚𝟏

𝒉𝟎

𝒙𝟐

𝒉𝟐

𝒚𝟐

𝒗𝐭 = 𝐖ℎ𝑦𝒉𝒕

𝐲𝐭 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝒗𝐭

𝒉𝐭 = 𝑡𝑎𝑛ℎ 𝒖𝐭

𝒖𝑡 = 𝑾𝑥ℎ𝒙𝑡 +𝑾ℎℎ𝒉𝑡−1

𝒗𝐭 = 𝐖ℎ𝑦𝒉𝒕

𝐲𝐭 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝒗𝐭

𝒉𝐭 = 𝑡𝑎𝑛ℎ 𝒖𝐭

𝒖𝑡 = 𝑾𝑥ℎ𝒙𝑡 +𝑾ℎℎ𝒉𝑡−1

Let’s focus on the more actual process

See RNN as a large function with

input (𝒙𝑡 , 𝒉𝑡−1) and return (𝒚𝑡 , 𝒉𝑡)

is function

10/43

Page 11: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

𝒙𝟏

𝒉𝟏

𝒚𝟏

𝒉𝟎

𝒗𝐭 = 𝐖ℎ𝑦𝒉𝒕

𝐲𝐭 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝒗𝐭

𝒉𝐭 = 𝑡𝑎𝑛ℎ 𝒖𝐭

𝒖𝑡 = 𝑾𝑥ℎ𝒙𝑡 +𝑾ℎℎ𝒉𝑡−1

LSTM

Oh, this looks

same as LSTM!

Let’s focus on the more actual process

See RNN as a large function with

input (𝒙𝑡 , 𝒉𝑡−1) and return (𝒚𝑡 , 𝒉𝑡)

is function

11/43

Page 12: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

Summary of this section

RNN RNNRNN

𝒙𝟏 𝒙𝟐 𝒙𝟑

𝒚𝟏 𝒚𝟐 𝒚𝟑

Yeah. Moreover, initial hidden state ℎ0 is

often omitted like below.

LSTM figure is not special!

If you see RNN as LSTM, in fact, you need

to give cell value to next time LSTM module,

but it is mostly omitted, too. 12/43

Page 13: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

By the way, if you want to see the contents of LSTM…

𝒙𝟏

𝒉𝟏𝒚𝟏𝒉𝟎

𝒛𝑡 = tanh 𝑾𝑥𝑧𝒙𝑡 +𝑾ℎ𝑧𝒉𝑡−1

ො𝒛t = 𝐳t ⊙𝒈𝑖,𝑡

𝒈𝑖,𝑡 = 𝜎(𝑾𝑥𝑖𝒙𝑡 +𝑾ℎ𝑖𝒉𝑡−1)

𝒈𝑓,𝑡 = 𝜎 𝑾𝒙𝑓𝒙𝑡 +𝑾ℎ𝑓𝒉𝑡−1

𝒈𝑜,𝑡 = 𝜎 𝑾𝑥𝑜𝒙𝑡 +𝑾ℎ𝑜𝒉𝑡−1

𝐡𝐭 = tanh 𝒄𝑡 ⊙𝒈𝑜,𝑡

𝐜t = ො𝒄t−1 + 𝑡𝑎𝑛ℎ ො𝒛𝑡

ො𝒄t−1 = 𝒄t−1 ⊙𝒈𝑓,𝑡

(𝜎 ⋅ = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 ⋅ とする)

𝐲t = 𝜎 𝑾ℎ𝑦𝒉t

𝒄𝒕

𝒄𝒕−𝟏

13/43

Page 14: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

LSTM FAQ

Q. What is the difference beween RNN and LSTM?

Constant Error Carousel(CEC, often called as cell)

input gate, forget gate, output

• Input gate: Select to accept input to cell or not

• Forget gate: Select to throw away cell information or not

• Output gate: Select to 次の時刻にどの程度情報を伝えるか選ぶ

Q. Why does LSTM avoid gradient vanishing problem?

1. BP is suffered because of repeatedly sigmoid diff calculation.

2. RNN output was effected from changeable hidden states.

3.LSTM has a cell and store previous input as sum of weighted

inputs, so they are robust to current hidden states( Of course,

there is a certain limit to remember the sequence)

14/43

Page 15: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

わかるLSTM ~ 最近の動向と共に から引用( http://qiita.com/t_Signull/items/21b82be280b46f467d1b )

LSTM

15/43

Page 16: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

わかるLSTM ~ 最近の動向と共に から引用( http://qiita.com/t_Signull/items/21b82be280b46f467d1b )

LSTM with Peephole

Known as Standard

LSTM, but peephole

omitted LSTM is

often used, too.

16/43

Page 17: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

Chainer usage

Not Peephole(Standard ver. in chainer)

chainer.links.LSTM

with Peephole

chainer.links.StatefulPeepholeLSTM

h = init_state()

h = stateless_lstm(h, x1)

h = stateless_lstm(h, x2)

stateful_lstm(x1)

stateful_lstm(x2)

“Stateful” means wrapping hidden state in

the internal state of the function(※)

Stateful○○ Stateless○○

(※) https://groups.google.com/forum/#!topic/chainer-jp/bJ9IQWtsef417/43

Page 18: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

2. LSTM Learning Methods

Full BPTT

Truncated BPTT

Graham Neubig NLP tutorial 8- recurrent neural networks

http://www.phontron.com/slides/nlp-programming-ja-08-rnn.pdf

(BPTT: Back Propagation Trough Time)

18/43

Page 19: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

Truncated BPTT by chainer

Chainerの使い方と自然言語処理への応用 から引用http://www.slideshare.net/beam2d/chainer-52369222

19/43

Page 20: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

ChainerでTruncated BPTT

LSTMLSTM

𝒙𝟏 𝒙𝟐

𝒚𝟏 𝒚𝟐

LSTM LSTM

𝒙𝟐 𝒙𝟑

𝒚𝟑𝟏 𝒚𝟑𝟐

LSTM

𝒙𝟑

𝒚𝟑𝟎

⋯⋯

𝒉𝟐𝒉𝟏 𝒉𝟑𝟎

BP until 𝒊 = 𝟑𝟎BP Update weights

20/43

Page 21: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

Mini-batch calculation with GPU

How should I do if I want to use GPU with

unaligned data length?

Filling end of sequence is standard.

ex): End of sequence is 0

1 2 0

1 3 3 2 0

1 4 2 0

1 2 0 0 0

1 3 3 2 0

1 4 2 0 0I call them

Zero padding

21/43

Page 22: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

Learned model become redundant!

They should learn “continuous 0 output rule”

Adding handcraft rule can solve it.

chainer.functions.where

NStepLSTM(v1.16.0 or later)

There are 2 methods in chainer

Mini-batch calculation with GPU

22/43

Page 23: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

chainer.functions.where

𝒙𝒕

𝒚𝒕

1 2 0 0 0

1 3 3 2 0

1 4 2 0 0

𝒄𝑡−1, 𝒉𝑡−1 𝒄𝑡 , 𝒉𝑡

False, False,…,False

True , True ,…,True

False, False,…,False

LSTM

𝒄𝑡𝑚𝑝, 𝒉𝑡𝑚𝑝𝒄𝐭 = F.where 𝑺, 𝒄𝑡𝑚𝑝, 𝒄𝑡−1

𝒉𝐭 = F.where 𝑺, 𝒉𝑡𝑚𝑝, 𝒉𝑡−1

𝒉𝒕−𝟏1

𝒉𝒕−𝟏2

𝒉𝒕−𝟏3

𝑺 =

𝒉𝒕−𝟏1

𝒉𝒕2

𝒉𝒕−𝟏3

𝒉𝑡−1 𝒉𝑡

Condition matrix

True False

23/43

Page 24: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

NStepLSTM(v1.16.0 or later)

NStepLSTM can auto filling

There is a bug with cudnn,dropout(※)10/25 fixed version marged to master repository

Use latest version(wait for v1.18.0 or git clone from github)

https://github.com/pfnet/chainer/pull/1804

There is no document now, read raw script below

https://github.com/pfnet/chainer/blob/master/chainer/function

s/connection/n_step_lstm.py

http://www.monthly-hack.com/entry/2016/10/24/200000

(※)ChainerのNStepLSTMでニコニコ動画のコメント予測。

I don’t need to listen F.where ?

Hahaha…

24/43

Page 25: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

Gradient Clipping can suppress gradient explosion

LSTM can solve gradient vanishing problem, but

RNN also suffer from gradient explosion(※)

※ On the difficulty of training recurrent neural networks

http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf

Proposed by ※

If norm of all gradient is over

the threshold, make norm

threshold

In chainer, you can use

optimizer.add_hook(Gradient_Clipping(threshold))

25/43

Page 26: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

DropOut application to LSTM

DropOut is a strong smoothing method,

But DropOut anywhere doesn’t always success.

https://arxiv.org/abs/1603.05118

※ Recurrent Dropout without Memory Loss

According to ※,

1.DropOut hidden recurrent state in LSTM

2.DropOut cell in LSTM

3.DropOut input gate in LSTM

Conclusion: 3.achieved the best performance.

Basically,

Recurrent part →DropOut should not be applied to

Forward part →DropOut should be applied to

26/43

Page 27: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

Batch Normalization on LSTM

Batch Normalization?

Scaling activation(sum of weighted input) distribution to N(0,1)

http://jmlr.org/proceedings/papers/v37/ioffe15.pdf

In practice,

BN is applied to mini-

batch

In theory,

BN should be applied

to all data

Batch Normalization on Activation x27/43

Page 28: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

BN to RNN doesn’t improve the performance(※) hidden-to-hidden suffer from gradient explosion by repeatedly

scaling

input-to-hidden makes learning faster, but not improve

performance

※Batch Normalized Recurrent Neural Networks

https://arxiv.org/abs/1510.01378

3 new proposed way(proposed date order) (Weight Normalization) https://arxiv.org/abs/1602.07868

(Recurrent Batch Normalization) https://arxiv.org/abs/1603.09025

Layer Normalization https://arxiv.org/abs/1607.06450

Batch Normalization on LSTM

28/43

Page 29: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

𝑎1(1)

𝑎2(1)

𝑎3(1)

𝑎4(1)

⋯ 𝑎𝐻(1)

𝑎1(1)

𝑎2(1)

𝑎3(1)

𝑎4(1)

⋯ 𝑎𝐻1

⋮ ⋮ ⋮ ⋮ ⋮

𝑎1(1)

𝑎2(1)

𝑎3(1)

𝑎4(1)

⋯ 𝑎𝐻(1)

Difference between

Batch Normalization and Layer NormalizationAssuming activation 𝒂 (𝑎𝑖

(𝑛)= Σ𝑗𝑤𝑖𝑗𝑥𝑗

𝑛, h𝑖

𝑛= 𝑎𝑖

(𝑛))

Batch Normalization

normalizes vertically

Layer Normalization

normalizes horizontal

Variance 𝜎 becomes larger if gradient explosion happens.

Normalization makes output more robust(detail is in paper)29/43

Page 30: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

Initialization Tips

Exact solutions to the nonlinear dynamics of

learning in deep linear neural networks

https://arxiv.org/abs/1312.6120v3

A Simple Way to Initialize Recurrent Networks

of Rectified Linear Units https://arxiv.org/abs/1504.00941v2

RNN with ReLU and recurrent weight connections

initialized by identity matrix is as good as LSTM

30/43

Page 31: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

From “A Simple Way to Initialize Recurrent Networks of Rectified Linear Units” 31/43

Page 32: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

MNIST 784 sequence prediction

32/43

Page 33: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

𝒉𝑡 = tanh 𝑾𝑥ℎ𝒙𝑡 +𝑾ℎℎ𝒉𝑡−1𝒚𝑡 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑾ℎ𝑦𝒉𝑡

𝒙𝟏

𝒉𝟏

𝒚𝟏

𝒉𝟎

𝒉𝑡 = ReL𝑈 𝑾𝑥ℎ𝒙𝑡 +𝑾ℎℎ𝒉𝑡−1𝒚𝑡 = 𝑅𝑒𝐿𝑈 𝑾ℎ𝑦𝒉𝑡

IRNN

Initialize by

identity matrix

x=0の時、

h=ReLU(h)

33/43

Page 34: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

Extra materials

34/43

Page 35: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

Various RNN model

Encoder-Decoder

Bidirectional LSTM

Attention model

35/43

Page 36: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

RNN RNNRNN

𝒙𝟏 𝒙𝟐 𝒙𝟑

𝒚𝟏 𝒚𝟐 𝒚𝟑

𝒉𝟎

RNN output is changed by

initial hidden states ℎ0

ℎ0 is also learnable by BP

It can be connected to an

encoder output

→encoder-decoder model

RNNスライスpixel生成

original

gen from learned ℎ0gen from random ℎ0

RNNの隠れ層の初期値に注目する

2 ⋮

First slice is 0(black), but

various sequence appear

36/43

Page 37: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

Encoder-Decoder model

RNN RNNRNN

𝒙𝟏 𝒙𝟐 𝒙𝟑

𝒚𝟏 𝒚𝟐 𝒚𝟑

𝒉𝟎𝒅𝒆𝒄RNN RNNRNN

𝒙𝟏𝒆𝒏𝒄 𝒙𝟐

𝒆𝒏𝒄 𝒙𝟑𝒆𝒏𝒄

Point: Use when your I/O data have different sequence length

𝒉𝟎𝒅𝒆𝒄 is learned by encoder and decoder learning on the same

time

To improve performance, you can use beamsearch on Decoder

37/43

Page 38: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

Bidirectional LSTM

RNN RNNRNN

𝒙𝟏 𝒙𝟐 𝒙𝟑

𝒚𝟏 𝒚𝟐 𝒚𝟑

𝒉𝟎𝒅𝒆𝒄RNN RNNRNN

𝒙𝟑𝒆𝒏𝒄 𝒙𝟐

𝒆𝒏𝒄 𝒙𝟏𝒆𝒏𝒄

Long long time dependency is difficult to learn unless you use

LSTM (LSTM doesn’t solve gradient vanishing fundamentally)

You can improve performance by using inverted encoder

𝒉𝟎𝒅𝒆𝒄RNN RNNRNN

𝒙𝟏𝒆𝒏𝒄 𝒙𝟐

𝒆𝒏𝒄 𝒙𝟑𝒆𝒏𝒄

I remember latter information!

I remember former information!

38/43

Page 39: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

Attention model

RNN RNNRNN

𝒙𝟏 𝒙𝟐 𝒙𝟑

𝒚𝟏 𝒚𝟐 𝒚𝟑

𝒉𝟎𝒅𝒆𝒄RNN RNNRNN

𝒙𝟑𝒆𝒏𝒄 𝒙𝟐

𝒆𝒏𝒄 𝒙𝟏𝒆𝒏𝒄

𝒉𝟎𝒅𝒆𝒄RNN RNNRNN

𝒙𝟏𝒆𝒏𝒄 𝒙𝟐

𝒆𝒏𝒄 𝒙𝟑𝒆𝒏𝒄

𝒉𝟏𝒆𝒏𝒄

𝒉𝟑𝒆𝒏𝒄 𝒉2

𝒆𝒏𝒄

𝒉𝟐𝒆𝒏𝒄

𝒉𝟎𝒅𝒆𝒄

𝒉𝟎𝒅𝒆𝒄

𝛼1,𝑡 𝛼2,𝑡 𝛼3,𝑡

𝜶1 𝜶2 𝜶3

Moreover, using middle

hidden states of encoder

leads better performance!

𝒉𝟏𝒆𝒏𝒄 𝒉𝟐

𝒆𝒏𝒄

𝒉2𝒆𝒏𝒄

𝒉𝟑𝒆𝒏𝒄

39/43

Page 40: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

Gated Recurrent Unit (GRU)

Variant of LSTM

• Delete cell

• Gate are reduced to 2

Unless less complexity,

performance is not bad

Often appear on

MT task or SD task

LSTM

40/43

Page 41: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

Try to split LSTM, and make them upside down

1. GRU is to hidden states what LSTM is to cell

2. Share Input gate and Output gate as Update gate

3. Delete tanh function of cell output of LSTM

GRU can be interpreted as special case of LSTM

GRU LSTM

41/43

Page 42: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

1. Try to split LSTM, and make them upside down

GRU can be interpreted as special case of LSTM

LSTM

42/43

Page 43: SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

GRU can be interpreted as special case of LSTM

GRU LSTM

1. Try to split LSTM, and make them upside down

2. See LSTM cell as GRU hidden states

3. Share Input gate and Output gate as Update gate

4. Delete tanh function of cell output of LSTM

43/43