SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡

SD StudyRNN & LSTM 2016/11/10

Seitaro Shinagawa

1/43

This is a description for people

who have already understood

simple neural network

architecture like feed forward

networks.

2/43

I will introduce LSTM,

how to use, tips in chainer.

3/43

わかるLSTM ～最近の動向と共にから引用( http://qiita.com/t_Signull/items/21b82be280b46f467d1b )

１．RNN to LSTM

Output

Layer

Middle(Hidden)

Layer

Input

Layer

Simple RNN

4/43

http://qiita.com/t_Signull/items/21b82be280b46f467d1b

FAQ with LSTM beginner students.

LSTM LSTMLSTM

I hear LSTM is kind of RNN,

but LSTM looks different architecture…

These have same architecture!

Please follow me! Neural bear

𝒙𝒕

𝒉𝒕

𝒚𝒕

A-san

A-san often sees this RNN A-san often sees this LSTM

𝒙𝟏 𝒙𝟐 𝒙𝟑

𝒚𝟏 𝒚𝟐 𝒚𝟑These are same?

different?

5/43

𝒙𝒕

𝒉𝒕

𝒚𝒕

Introduce LSTM figure from RNN

6/43

𝒙𝒕

𝒉𝒕

𝒚𝒕

Unroll on time scale


7/43

𝒙𝟏

𝒉𝟏

𝒚𝟏

𝒉𝟎

𝒙𝟐

𝒉𝟐

𝒚𝟐

𝒙𝟑

𝒉𝟑

𝒚𝟑

Oh, I often see this in RNN!

𝒉𝑡 = tanh 𝑾𝑥ℎ𝒙𝑡 +𝑾ℎℎ𝒉𝑡−1𝒚𝑡 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑾ℎ𝑦𝒉𝑡

So, this figure focuses on variables

and shows that their relationships.

Unroll on time scale


8/43

Let’s focus on the more actual process

𝒙𝟏

𝒉𝟏

𝒚𝟏

𝒉𝟎

𝒙𝟐

𝒉𝟐

𝒚𝟐

I try to write the architecture detail.

𝒗𝐭 = 𝐖ℎ𝑦𝒉𝒕

𝐲𝐭 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝒗𝐭

is function

𝒉𝐭 = 𝑡𝑎𝑛ℎ 𝒖𝐭

𝒖𝑡 = 𝑾𝑥ℎ𝒙𝑡 +𝑾ℎℎ𝒉𝑡−1





See RNN as a large function with

input (𝒙𝑡 , 𝒉𝑡−1) and return (𝒚𝑡 , 𝒉𝑡)

9/43

𝒙𝟏

𝒉𝟏

𝒚𝟏

𝒉𝟎

𝒙𝟐

𝒉𝟐

𝒚𝟐












is function

10/43

𝒙𝟏

𝒉𝟏

𝒚𝟏

𝒉𝟎





LSTM

Oh, this looks

same as LSTM!




is function

11/43

Summary of this section

RNN RNNRNN

𝒙𝟏 𝒙𝟐 𝒙𝟑

𝒚𝟏 𝒚𝟐 𝒚𝟑

Yeah. Moreover, initial hidden state ℎ0 is

often omitted like below.

LSTM figure is not special!

If you see RNN as LSTM, in fact, you need

to give cell value to next time LSTM module,

but it is mostly omitted, too. 12/43

By the way, if you want to see the contents of LSTM…

𝒙𝟏

𝒉𝟏𝒚𝟏𝒉𝟎

𝒛𝑡 = tanh 𝑾𝑥𝑧𝒙𝑡 +𝑾ℎ𝑧𝒉𝑡−1

ො𝒛t = 𝐳t ⊙𝒈𝑖,𝑡

𝒈𝑖,𝑡 = 𝜎(𝑾𝑥𝑖𝒙𝑡 +𝑾ℎ𝑖𝒉𝑡−1)

𝒈𝑓,𝑡 = 𝜎 𝑾𝒙𝑓𝒙𝑡 +𝑾ℎ𝑓𝒉𝑡−1

𝒈𝑜,𝑡 = 𝜎 𝑾𝑥𝑜𝒙𝑡 +𝑾ℎ𝑜𝒉𝑡−1

𝐡𝐭 = tanh 𝒄𝑡 ⊙𝒈𝑜,𝑡

𝐜t = ො𝒄t−1 + 𝑡𝑎𝑛ℎ ො𝒛𝑡

ො𝒄t−1 = 𝒄t−1 ⊙𝒈𝑓,𝑡

（𝜎 ⋅ = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 ⋅ とする）

𝐲t = 𝜎 𝑾ℎ𝑦𝒉t

𝒄𝒕

𝒄𝒕−𝟏

13/43

LSTM FAQ

Q. What is the difference beween RNN and LSTM?

Constant Error Carousel(CEC, often called as cell)

input gate, forget gate, output

• Input gate: Select to accept input to cell or not

• Forget gate: Select to throw away cell information or not

• Output gate: Select to 次の時刻にどの程度情報を伝えるか選ぶ

Q. Why does LSTM avoid gradient vanishing problem?

１. BP is suffered because of repeatedly sigmoid diff calculation.

２. RNN output was effected from changeable hidden states.

３．LSTM has a cell and store previous input as sum of weighted

inputs, so they are robust to current hidden states( Of course,

there is a certain limit to remember the sequence）

14/43


LSTM

15/43



LSTM with Peephole

Known as Standard

LSTM, but peephole

omitted LSTM is

often used, too.

16/43


Chainer usage

Not Peephole(Standard ver. in chainer)

chainer.links.LSTM

with Peephole

chainer.links.StatefulPeepholeLSTM

h = init_state()

h = stateless_lstm(h, x1)

h = stateless_lstm(h, x2)

stateful_lstm(x1)

stateful_lstm(x2)

“Stateful” means wrapping hidden state in

the internal state of the function(※)

Stateful○○ Stateless○○

(※) https://groups.google.com/forum/#!topic/chainer-jp/bJ9IQWtsef417/43

https://groups.google.com/forum/#!topic/chainer-jp/bJ9IQWtsef4

2. LSTM Learning Methods

Full BPTT

Truncated BPTT

Graham Neubig NLP tutorial 8- recurrent neural networks

http://www.phontron.com/slides/nlp-programming-ja-08-rnn.pdf

（BPTT: Back Propagation Trough Time）

18/43

http://www.phontron.com/slides/nlp-programming-ja-08-rnn.pdf

Truncated BPTT by chainer

Chainerの使い方と自然言語処理への応用から引用http://www.slideshare.net/beam2d/chainer-52369222

19/43

http://www.slideshare.net/beam2d/chainer-52369222

ChainerでTruncated BPTT

LSTMLSTM

𝒙𝟏 𝒙𝟐

𝒚𝟏 𝒚𝟐

LSTM LSTM

𝒙𝟐 𝒙𝟑

𝒚𝟑𝟏 𝒚𝟑𝟐

LSTM

𝒙𝟑

𝒚𝟑𝟎

⋯⋯

𝒉𝟐𝒉𝟏 𝒉𝟑𝟎

BP until 𝒊 = 𝟑𝟎BP Update weights

20/43

Mini-batch calculation with GPU

How should I do if I want to use GPU with

unaligned data length?

Filling end of sequence is standard.

ex): End of sequence is 0

1 2 0

1 3 3 2 0

1 4 2 0

1 2 0 0 0

1 3 3 2 0

1 4 2 0 0I call them

Zero padding

21/43

Learned model become redundant!

They should learn “continuous 0 output rule”

Adding handcraft rule can solve it.

chainer.functions.where

NStepLSTM(v1.16.0 or later)

There are 2 methods in chainer

Mini-batch calculation with GPU

22/43

chainer.functions.where

𝒙𝒕

𝒚𝒕

1 2 0 0 0

1 3 3 2 0

1 4 2 0 0

𝒄𝑡−1, 𝒉𝑡−1 𝒄𝑡 , 𝒉𝑡

False, False,…,False

True , True ,…,True

False, False,…,False

LSTM

𝒄𝑡𝑚𝑝, 𝒉𝑡𝑚𝑝𝒄𝐭 = F.where 𝑺, 𝒄𝑡𝑚𝑝, 𝒄𝑡−1

𝒉𝐭 = F.where 𝑺, 𝒉𝑡𝑚𝑝, 𝒉𝑡−1

𝒉𝒕−𝟏1

𝒉𝒕−𝟏2

𝒉𝒕−𝟏3

𝑺 =

𝒉𝒕−𝟏1

𝒉𝒕2

𝒉𝒕−𝟏3

𝒉𝑡−1 𝒉𝑡

Condition matrix

True False

23/43

NStepLSTM(v1.16.0 or later)

NStepLSTM can auto filling

There is a bug with cudnn,dropout（※）10/25 fixed version marged to master repository

Use latest version(wait for v1.18.0 or git clone from github)

https://github.com/pfnet/chainer/pull/1804

There is no document now, read raw script below

https://github.com/pfnet/chainer/blob/master/chainer/function

s/connection/n_step_lstm.py

http://www.monthly-hack.com/entry/2016/10/24/200000

（※）ChainerのNStepLSTMでニコニコ動画のコメント予測。

I don’t need to listen F.where ?

Hahaha…

24/43

https://github.com/pfnet/chainer/pull/1804

https://github.com/pfnet/chainer/blob/master/chainer/functions/connection/n_step_lstm.py

http://www.monthly-hack.com/entry/2016/10/24/200000

Gradient Clipping can suppress gradient explosion

LSTM can solve gradient vanishing problem, but

RNN also suffer from gradient explosion(※)

※ On the difficulty of training recurrent neural networks

http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf

Proposed by ※

If norm of all gradient is over

the threshold, make norm

threshold

In chainer, you can use

optimizer.add_hook(Gradient_Clipping(threshold))

25/43

http://www.jmlr.org/proceedings/papers/v28/pascanu13.pdf

DropOut application to LSTM

DropOut is a strong smoothing method,

But DropOut anywhere doesn’t always success.

https://arxiv.org/abs/1603.05118

※ Recurrent Dropout without Memory Loss

According to ※,

１．DropOut hidden recurrent state in LSTM

２．DropOut cell in LSTM

３．DropOut input gate in LSTM

Conclusion: ３．achieved the best performance.

Basically,

Recurrent part →DropOut should not be applied to

Forward part →DropOut should be applied to

26/43


Batch Normalization on LSTM

Batch Normalization?

Scaling activation(sum of weighted input) distribution to N(0,1)

http://jmlr.org/proceedings/papers/v37/ioffe15.pdf

In practice,

BN is applied to mini-

batch

In theory,

BN should be applied

to all data

Batch Normalization on Activation x27/43

http://jmlr.org/proceedings/papers/v37/ioffe15.pdf

BN to RNN doesn’t improve the performance（※） hidden-to-hidden suffer from gradient explosion by repeatedly

scaling

input-to-hidden makes learning faster, but not improve

performance

※Batch Normalized Recurrent Neural Networks


3 new proposed way（proposed date order） (Weight Normalization) https://arxiv.org/abs/1602.07868

(Recurrent Batch Normalization) https://arxiv.org/abs/1603.09025

Layer Normalization https://arxiv.org/abs/1607.06450

Batch Normalization on LSTM

28/43





𝑎1(1)

𝑎2(1)

𝑎3(1)

𝑎4(1)

⋯ 𝑎𝐻(1)

𝑎1(1)

𝑎2(1)

𝑎3(1)

𝑎4(1)

⋯ 𝑎𝐻1

⋮ ⋮ ⋮ ⋮ ⋮

𝑎1(1)

𝑎2(1)

𝑎3(1)

𝑎4(1)

⋯ 𝑎𝐻(1)

Difference between

Batch Normalization and Layer NormalizationAssuming activation 𝒂 （𝑎𝑖

(𝑛)= Σ𝑗𝑤𝑖𝑗𝑥𝑗

𝑛, h𝑖

𝑛= 𝑎𝑖

(𝑛)）

Batch Normalization

normalizes vertically

Layer Normalization

normalizes horizontal

Variance 𝜎 becomes larger if gradient explosion happens.

Normalization makes output more robust（detail is in paper）29/43

Initialization Tips

Exact solutions to the nonlinear dynamics of

learning in deep linear neural networks

https://arxiv.org/abs/1312.6120v3

A Simple Way to Initialize Recurrent Networks

of Rectified Linear Units https://arxiv.org/abs/1504.00941v2

RNN with ReLU and recurrent weight connections

initialized by identity matrix is as good as LSTM

30/43



From “A Simple Way to Initialize Recurrent Networks of Rectified Linear Units” 31/43

MNIST 784 sequence prediction

32/43

𝒉𝑡 = tanh 𝑾𝑥ℎ𝒙𝑡 +𝑾ℎℎ𝒉𝑡−1𝒚𝑡 = 𝑠𝑖𝑔𝑚𝑜𝑖𝑑 𝑾ℎ𝑦𝒉𝑡

𝒙𝟏

𝒉𝟏

𝒚𝟏

𝒉𝟎

𝒉𝑡 = ReL𝑈 𝑾𝑥ℎ𝒙𝑡 +𝑾ℎℎ𝒉𝑡−1𝒚𝑡 = 𝑅𝑒𝐿𝑈 𝑾ℎ𝑦𝒉𝑡

IRNN

Initialize by

identity matrix

x=0の時、

h=ReLU(h)

33/43

Extra materials

34/43

Various RNN model

Encoder-Decoder

Bidirectional LSTM

Attention model

35/43

RNN RNNRNN

𝒙𝟏 𝒙𝟐 𝒙𝟑

𝒚𝟏 𝒚𝟐 𝒚𝟑

𝒉𝟎

RNN output is changed by

initial hidden states ℎ0

ℎ0 is also learnable by BP

It can be connected to an

encoder output

→encoder-decoder model

RNNスライスpixel生成

original

gen from learned ℎ0gen from random ℎ0

RNNの隠れ層の初期値に注目する

２ ⋮

First slice is 0(black), but

various sequence appear

36/43

Encoder-Decoder model

RNN RNNRNN

𝒙𝟏 𝒙𝟐 𝒙𝟑

𝒚𝟏 𝒚𝟐 𝒚𝟑

𝒉𝟎𝒅𝒆𝒄RNN RNNRNN

𝒙𝟏𝒆𝒏𝒄 𝒙𝟐

𝒆𝒏𝒄 𝒙𝟑𝒆𝒏𝒄

Point: Use when your I/O data have different sequence length

𝒉𝟎𝒅𝒆𝒄 is learned by encoder and decoder learning on the same

time

To improve performance, you can use beamsearch on Decoder

37/43

Bidirectional LSTM

RNN RNNRNN

𝒙𝟏 𝒙𝟐 𝒙𝟑

𝒚𝟏 𝒚𝟐 𝒚𝟑


𝒙𝟑𝒆𝒏𝒄 𝒙𝟐

𝒆𝒏𝒄 𝒙𝟏𝒆𝒏𝒄

Long long time dependency is difficult to learn unless you use

LSTM (LSTM doesn’t solve gradient vanishing fundamentally)

You can improve performance by using inverted encoder




I remember latter information!

I remember former information!

38/43

Attention model

RNN RNNRNN

𝒙𝟏 𝒙𝟐 𝒙𝟑

𝒚𝟏 𝒚𝟐 𝒚𝟑


𝒙𝟑𝒆𝒏𝒄 𝒙𝟐

𝒆𝒏𝒄 𝒙𝟏𝒆𝒏𝒄




𝒉𝟏𝒆𝒏𝒄

𝒉𝟑𝒆𝒏𝒄 𝒉2

𝒆𝒏𝒄

𝒉𝟐𝒆𝒏𝒄

𝒉𝟎𝒅𝒆𝒄

𝒉𝟎𝒅𝒆𝒄

𝛼1,𝑡 𝛼2,𝑡 𝛼3,𝑡

𝜶1 𝜶2 𝜶3

Moreover, using middle

hidden states of encoder

leads better performance!

𝒉𝟏𝒆𝒏𝒄 𝒉𝟐

𝒆𝒏𝒄

𝒉2𝒆𝒏𝒄

𝒉𝟑𝒆𝒏𝒄

39/43

Gated Recurrent Unit (GRU)

Variant of LSTM

• Delete cell

• Gate are reduced to ２

Unless less complexity,

performance is not bad

Often appear on

MT task or SD task

LSTM

40/43

Try to split LSTM, and make them upside down

1. GRU is to hidden states what LSTM is to cell

2. Share Input gate and Output gate as Update gate

3. Delete tanh function of cell output of LSTM

GRU can be interpreted as special case of LSTM

GRU LSTM

41/43

1. Try to split LSTM, and make them upside down


LSTM

42/43


GRU LSTM

1. Try to split LSTM, and make them upside down

2. See LSTM cell as GRU hidden states

3. Share Input gate and Output gate as Update gate

4. Delete tanh function of cell output of LSTM

43/43

Documents

SD Study RNN & LSTMisw3.naist.jp/.../student/2015/seitaro-s/161110SDstudy.pdfOh, I often see this in RNN! 𝑡=tanh𝑾 ℎ 𝑡+𝑾ℎℎ 𝑡−1 𝑡=𝑖𝑔𝑖 𝑾ℎ 𝑡