28
Long-Short Term Memory Network Hien Van Nguyen University of Houston 11/6/2017

Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

  • Upload
    others

  • View
    1

  • Download
    0

Embed Size (px)

Citation preview

Page 1: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

Long-Short Term Memory Network

Hien Van Nguyen

University of Houston

11/6/2017

Page 2: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

Why recurrent networks?

• Sequential input, next state depends on previous state

• Generalize to input with variable length

• Consider smaller chunk fewer parameters in model

11/7/2017 Machine Learning 2

Page 3: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

What is sequence?

11/7/2017 Machine Learning 3

Source: https://uvadlc.github.io/lectures/lecture8.pdf

Page 4: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

One-hot vector

11/7/2017 Machine Learning 4

Page 5: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

Recurrent networks

11/7/2017 Machine Learning 5

Unroll through time

Page 6: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

Recurrent networks

11/7/2017 Machine Learning 6

Unroll through time

Page 7: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

Simple recurrent network

• Linear activation

• Gradient:

• 𝑇𝑇 is the number of timestepsconsidered

11/7/2017 Machine Learning 7

Page 8: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

Problem of Vanishing/Exploding Gradient

• Review of chain rule

• Apply chain rule:

11/7/2017 Machine Learning 8

How change in V at step k will affect loss at step t

On the difficulty of training recurrent networks https://arxiv.org/pdf/1211.5063.pdf

Page 9: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

Problem of Vanishing/Exploding Gradient

• Recall that:

• Using chain rule:

11/7/2017 Machine Learning 9

Page 10: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

Problem of Vanishing/Exploding Gradient

11/7/2017 Machine Learning 10

Page 11: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

Long-Short Term Recurrent Networks (LSTM)

• Idea: Don’t multiply Multiplication == Vanishing gradients

Instead of multiplying previous hidden state by a matrix to get new state

we add something to old hidden state and get new state (not called “hidden state” but “cell” in LSTM language, explained next)

11/7/2017 Machine Learning 11

Page 12: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

Long-Short Term Recurrent Networks (LSTM)

• Intuition:Not everything is useful to rememberNot every input is useful to takeNot necessary to output each instance

11/6/2017 Machine Learning 12

Page 13: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

Long-Short Term Recurrent Networks (LSTM)

• Comparison of vanilla RNN and LSTM

11/7/2017 Machine Learning 13

Vanilla RNN

LSTM

Source: http://colah.github.io/posts/2015-08-Understanding-LSTMs/

Page 14: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

Long-Short Term Recurrent Networks (LSTM)

• Comparison of vanilla RNN and LSTM

11/7/2017 Machine Learning 14

Vanilla RNN

LSTM

Page 15: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

LSTM-Step by Step

11/7/2017 Machine Learning 15

Page 16: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

LSTM-Step by Step

11/7/2017 Machine Learning 16

Page 17: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

LSTM-Step by Step

11/7/2017 Machine Learning 17

Page 18: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

LSTM-Step by Step

11/7/2017 Machine Learning 18

Page 19: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

LSTM-Step by Step

11/7/2017 Machine Learning 19

Page 20: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

Long-Short Term Recurrent Networks (LSTM)

• Comparison of vanilla RNN and LSTM

11/7/2017 Machine Learning 20

Vanilla RNN

LSTM

Page 21: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

LSTM-Gradient Flow

11/7/2017 Machine Learning 21

Learning sequence representation:https://d-nb.info/1082034037/34

http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf

Page 22: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

LSTM-Gradient Flow

11/7/2017 Machine Learning 22

Page 23: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

Applications – Machine Translation

11/7/2017 Machine Learning 23

Source: https://uvadlc.github.io/lectures/lecture8.pdf

Page 24: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

Applications – Machine Translation

11/7/2017 Machine Learning 24

Google Pixel Buds

Page 25: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

Applications – Image Captioning

11/7/2017 Machine Learning 25

Page 26: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

Applications – Question Answering

11/7/2017 Machine Learning 26

Page 27: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

Applications – Visual Question Answering

11/7/2017 Machine Learning 27

Source: http://cs231n.stanford.edu/slides/2017/cs231n_2017_lecture10.pdf

Page 28: Long-Short Term Memory Network - WordPress.com · 2017-11-07 · Long-Short Term Recurrent Networks (LSTM) • Idea: Don’t multiply Multiplication == Vanishing gradients Insteadof

Applications – Visual Question Answering

11/7/2017 Machine Learning 28