Visualizing and Understanding Recurrent Networksqye/MA721/presentations/MA721...Vanilla RNN-- Review • It takes into consideration the current input and also what it has learned

Visualizing and Understanding Recurrent

NetworksAndrej Karpathy, Justin Johnson, Li Fei-Fei

Stanford University

Presented by Jing (Lucas) Liu

1

Outline

• Introduction• Review of RNN and its variants• Character-Level Language model• Internal Mechanism of LSTM• Error analysis • Conclusion

2

Introduction

3

• RNNs provides exceptional results• Especially good for sequence data• But the source of their performance and their limitations remain

rather poorly understood.

Vanilla RNN-- Review

• It takes into consideration the current input and also what it has learned from the inputs it received previously.

4

LSTM-Review

• Intuitions: • Forget gate 𝑓𝑓𝑡𝑡 : remember this “bit” of

information or not• Input gate 𝑖𝑖𝑡𝑡 : update this “bit” of

information or not• Output gate 𝑜𝑜𝑡𝑡: output this “bit” of

information to “deeper” layers• Candidate memory cell �̃�𝐶𝑡𝑡• Memory Cell 𝐶𝐶𝑡𝑡 : Forget that +

memorize this

5

𝑓𝑓𝑡𝑡 = 𝜎𝜎 ( 𝑊𝑊𝑓𝑓� ℎ𝑡𝑡−1, 𝑥𝑥𝑡𝑡 + 𝑏𝑏𝑓𝑓)𝑖𝑖𝑡𝑡 = 𝜎𝜎 ( 𝑊𝑊𝑖𝑖 � ℎ𝑡𝑡−1, 𝑥𝑥𝑡𝑡 + 𝑏𝑏𝑖𝑖 )𝑜𝑜𝑡𝑡 = 𝜎𝜎 ( 𝑊𝑊𝑜𝑜� ℎ𝑡𝑡−1, 𝑥𝑥𝑡𝑡 + 𝑏𝑏𝑜𝑜)

�̃�𝐶𝑡𝑡 = tanh(𝑊𝑊𝐶𝐶 � ℎ𝑡𝑡−1, 𝑥𝑥𝑡𝑡 + 𝑏𝑏𝐶𝐶)

𝐶𝐶𝑡𝑡 = 𝑓𝑓𝑡𝑡 ⊙ 𝐶𝐶𝑡𝑡−1 + 𝑖𝑖𝑡𝑡 ⊙ �̃�𝐶𝑡𝑡ℎ𝑡𝑡 = 𝑜𝑜𝑡𝑡 ⊙ tanh(𝐶𝐶𝑡𝑡)

GRU• Intuitions:

• Update gate 𝑧𝑧𝑡𝑡 : behavior like the input and forget gate in LSTM

• Reset gate 𝑟𝑟𝑡𝑡 : determine how relevant the previous information are for the candidate memory

• Candidate memory content �ℎ𝑡𝑡 : determine what to remove from the previous time steps

• Major difference with LSTM• Use update gate as input and forget gates• No output gates to control which part to pass along• GRU is faster to train

6

𝑧𝑧𝑡𝑡 = 𝜎𝜎 ( 𝑊𝑊𝑧𝑧� [ℎ𝑡𝑡−1, 𝑥𝑥𝑡𝑡])𝑟𝑟𝑡𝑡 = 𝜎𝜎 ( 𝑊𝑊𝑟𝑟� [ℎ𝑡𝑡−1, 𝑥𝑥𝑡𝑡])

�ℎ𝑡𝑡 = tanh(𝑊𝑊 � 𝑟𝑟𝑡𝑡 ⊙ ℎ𝑡𝑡−1, 𝑥𝑥𝑡𝑡 )ℎ𝑡𝑡 = (1 − 𝑧𝑧𝑡𝑡) ⊙ℎ𝑡𝑡−1 + 𝑧𝑧𝑡𝑡 ⊙ �ℎ𝑡𝑡

Character-Level Language Modeling• Input: a sequence of characters • Goal: to predict the next character in the sequence

• Two Baseline model:• N-gram: estimate the conditional probability of the next character in a

sequence:

• n-NN: A fully-connected neural network with one hidden layer and tanh nonlinearities.

7

Character-Level Language Modeling with RNN

Vocabulary:[h,e,l,o]

Example training sequence:“hello”

8

Shakespeare

9

10

11

Algebraic Geometry (Latex)

Linux Source Code

12

Performance comparison

13

• Depth of at least two is beneficial. However, between two and three layers our results are mixed.

Baseline language model comparison

The Best recurrent network :WP dataset: 1.077LK dataset: 0.84 14

Internal mechanisms of LSTM

• An LSTMs can in principle use its memory cells to remember long-range information and keep track of various attributes of text it is currently processing

• But …. How does these cells behaves?

15

Visualization of interpretable cells

Text color corresponds to tanh(c), where -1 is red and +1 is blue.

16

17

18

19

20

Long-range dependency results comparison

21

LSTM performance better when dealing with the special character requires long-range dependency

Long-range dependency case study: closing brace

22

• Closing brace (“}”) requires the longest-term reasoning• LSTM only slightly outperforms the 20-gram model in the first bin• LSTM gains significant boosts up to 60 characters

Other errors/limitations

• Hard to predict rare words• The first character of each word• If fails to predict the first occurrence then it will fail to predict the

second occurrence

23

Conclusion

• RNN, especially LSTM is good and powerful• Though large portion of cells that do not do anything

interpretable, about 5% of them turn out to have learned quite interesting patterns

• Illuminate the sources of remaining limitations

24

Thank you!

25

Documents

Visualizing and Understanding Recurrent Networksqye/MA721/presentations/MA721...Vanilla RNN-- Review • It takes into consideration the current input and also what it has learned