27
Understanding LSTM Networks

Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

  • Upload
    others

  • View
    6

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

Understanding LSTM Networks

Page 2: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

Recurrent Neural Networks

Page 3: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

An unrolled recurrent neural network

Page 4: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

The Problem of Long-Term Dependencies

Page 5: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

RNN short-term dependencies

A

x0

h0

A

x1

h1

A

x2

h2

A

x3

h3

A

x4

h4

Language model trying to predict the next word based on the previous ones

the clouds are in the sky,

Page 6: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

RNN long-term dependencies

A

x0

h0

A

x1

h1

A

x2

h2

A

x t−1

ht−1

A

x t

ht

Language model trying to predict the next word based on the previous ones

I grew up in India… I speak fluent Hindi.

Page 7: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

Standard RNN

Page 8: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

Backpropagation Through Time (BPTT)

Page 9: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

RNN forward passs t= tanh(Ux t+Wst−1)

y t=softmax(Vst)

E( y , y)=−∑t

E t( y t , y t)

V

U

W

V

U

W

V

U

W

V

U

W

V

U

W

Page 10: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

Backpropagation Through Time∂E∂W

=∑t

∂E t

∂W

∂E3

∂W=

∂E3

∂ y3

∂ y3

∂ s3

∂ s3

∂W

s3= tanh(Uxt +Ws2)

S_3 depends on s_2, which depends on W and s_1, and so on.

But

∂E3

∂W=∑

k=0

3 ∂ E3

∂ y3

∂ y3

∂ s3

∂ s3

∂ sk

∂ sk

∂W

Page 11: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

The Vanishing Gradient Problem∂E3

∂W=∑

k=0

3 ∂ E3

∂ y3

∂ y3

∂ s3

∂ s3

∂ sk

∂ sk

∂W

∂E3

∂W=∑

k=0

3 ∂ E3

∂ y3

∂ y3

∂ s3( ∏

j=k +1

3 ∂ s j

∂ s j−1) ∂ sk

∂W

● Derivative of a vector w.r.t a vector is a matrix called jacobian

● 2-norm of the above Jacobian matrix has an upper bound of 1

● tanh maps all values into a range between -1 and 1, and the derivative is bounded by 1

● With multiple matrix multiplications, gradient values shrink exponentially

● Gradient contributions from “far away” steps become zero

● Depending on activation functions and network parameters, gradients could explode instead of vanishing

Page 12: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

Activation function

Page 13: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

Basic LSTM

Page 14: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

Unrolling the LSTM through time

Page 15: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

Constant error carousel

Edge to next time stepΠ

Π

σ

σ

σ

Edge from previous time step(and current input)

Weight fixed at 1

it

o t

~C t

C t=~C t⋅i c

( t )+C t− 1

C t⋅o t

s t= tanh(Ux t+Wst−1)

Replaced by

Page 16: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

Input gate

Edge to next time stepΠ

Π

σ

σ

σ

Edge from previous time step(and current input)

Weight fixed at 1

i t

o t

~C t

C t=~Ct⋅ic

(t )+C t−1

C t⋅o t

● Use contextual information to decide● Store input into memory● Protect memory from overwritten

by other irrelevant inputs

Page 17: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

Output gate

Edge to next time stepΠ

Π

σ

σ

σ

Edge from previous time step(and current input)

Weight fixed at 1

i t

o t

~C t

C t=~Ct⋅ic

(t )+C t−1

C t⋅o t

● Use contextual information to decide● Access information in memory● Block irrelevant information

Page 18: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

Forget or reset gate

Edge to next time stepΠ

Π

Π

σ

σ

σ

σ

Edge from previous time step(and current input)

Weight fixed at 1

f t

it

o t

~C t

C t=~Ct⋅ic

(t )+C t−1⋅f t

C t⋅o t

Page 19: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

LSTM with four interacting layers

Page 20: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

The cell state

Page 21: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

Gates

sigmoid layer

Page 22: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

Step-by-Step LSTM Walk Through

Page 23: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

Forget gate layer

Page 24: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

Input gate layer

Page 25: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

The current state

Page 26: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

Output layer

Page 27: Understanding LSTM Networks - CSE · RNN long-term dependencies A x0 h0 A x1 h1 A x2 h2 A xt−1 ht−1 A xt ht Language model trying to predict the next word based on the previous

Refrence

● http://colah.github.io/posts/2015-08-Understanding-LSTMs/● http://www.wildml.com/● http://nikhilbuduma.com/2015/01/11/a-deep-dive-into-recurrent-neural-netwo

rks/● http://deeplearning.net/tutorial/lstm.html● https://theclevermachine.files.wordpress.com/2014/09/act-funs.png● http://blog.terminal.com/demistifying-long-short-term-memory-lstm-recurrent

-neural-networks/● A Critical Review of Recurrent Neural Networks for Sequence Learning,

Zachary C. Lipton, John Berkowitz● Long Short-Term Memory, Hochreiter, Sepp and Schmidhuber, Jurgen, 1997● Gers, F. A.; Schmidhuber, J. & Cummins, F. A. (2000), 'Learning to Forget:

Continual Prediction with LSTM.', Neural Computation 12 (10) , 2451-2471 .