32
LSTMs Overview Subhashini Venugopalan

LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

  • Upload
    others

  • View
    2

  • Download
    0

Embed Size (px)

Citation preview

Page 1: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

LSTMs Overview

Subhashini Venugopalan

Page 2: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

Neural Networks

B

zt

Input

Hidden

Hidden

Output

Page 3: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

WHY RNNs/LSTMs?

Accepts only fixed size input e.g 224x224 images.

Performs a fixed number of computations (#layers).

Outputs a fixed size vector.

B

zt

Input

Hidden

Hidden

Output

Can we operate over sequences of inputs?

Limitations of vanilla Neural Networks

Page 4: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

Recurrent Neural Networks

Image Credit: Chris Olah

They are networks with loops. [Elman ‘90]

Page 5: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

Un-Roll The Loop

Image Credit: Chris Olah

Recurrent Neural Network “unrolled in time”

● Each time step has a layer with the same weights.● The repeating layer/module is a sigmoid or a tanh.● Learns to model (ht | x1, …, xt-1)

Page 6: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

Simple RNNs

Image Credit: Chris Olah

sigmoidor

tanh

sigmoid or tanh

Page 7: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

Problems with Simple RNNs

● Can’t seem to handle “long-term dependencies” in practice● Gradients shrink through the many layers (Vanishing Gradients)

[Hochreiter ‘91][Bengio et. al. ‘94]Image Credit: Chris Olah

Page 8: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

Long Short Term Memory (LSTMs)

Image Credit: Chris Olah[Hochreiter and Schmidhuber ‘97]

Page 9: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

LSTM Unit

xt

ht-1

xt ht-1 xt ht-1

xt ht-1

ht

Memory Cell

Output GateInput

Gate

Forget Gate

Input ModulationGate

+

Memory Cell:Core of the LSTM UnitEncodes all inputs observed

[Hochreiter and Schmidhuber ‘97][Graves ‘13]

Page 10: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

LSTM Unit

xt

ht-1

xt ht-1 xt ht-1

xt ht-1

ht

Memory Cell

Output GateInput

Gate

Forget Gate

Input ModulationGate

+

Memory Cell:Core of the LSTM UnitEncodes all inputs observed

Gates:Input, Output and ForgetSigmoid [0,1]

[Hochreiter and Schmidhuber ‘97][Graves ‘13]

Page 11: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

LSTM Unit

xt

ht-1

xt ht-1 xt ht-1

xt ht-1

ht

Memory Cell

Output GateInput

Gate

Forget Gate

Input ModulationGate

+

[Hochreiter and Schmidhuber ‘97][Graves ‘13]

Update the Cell state

Learns long-term dependencies

Page 12: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

Can Model Sequences

● Can handle longer-term dependencies● Overcomes Vanishing Gradients problem● GRUs - Gated Recurrent Units is a much simpler variant which also

overcomes these issues.

LSTM LSTM LSTM LSTM

[Cho et. al. ‘14]

Page 13: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

Putting Things Together

Image Credit: Sutskever et. al.

Encode a sequence of inputs to a vector.(ht | x1, …, xt-1)

Decode from the vector to a sequence of outputs.Pr(xt | x1, …, xt-1)

Page 14: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

SOLVE A WIDER RANGE OF PROBLEMS

Image Captioning

Image Credit: Andrej Karpathy

Activity Recognition Sequence to Sequence

Vinyals et. al. ‘15,Donahue et. al. ‘15

Donahue et. al. ‘15 Machine TranslationSpeech RecognitionVideo DescriptionVQA, POS tagging, ...

Sutskever et. al. ‘14, Cho et. al. ‘14Graves & Jaitly ‘14V. et. al. ‘15, Li et. al. ‘153 of 4 papers to be discussed this class

Page 15: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

Resources● Graves’ paper - LSTMs explanation. Generating sequences with recurrent

neural networks. Applications to handwriting and speech recognition.● Chris’ Blog - LSTM unit explanation.● Karpathy’s Blog - Applications.● Tensorflow and Caffe - Code examples.

Page 16: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

Sequence to Sequence Video to Text

Subhashini Venugopalan, Marcus Rohrbach, Jeff Donahue Raymond Mooney, Trevor Darrell, Kate Saenko

Page 17: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

Objective

A monkey is pulling a dog’s tail and is chased by the dog.

Page 18: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

Encode

Recurrent Neural Networks (RNNs) can map a vector to a sequence.

[Donahue et al. CVPR’15]

[Sutskever et al. NIPS’14]

[Vinyals et al. CVPR’15]

EnglishSentence

RNN encoder

RNN decoder

FrenchSentence

Encode RNN decoder Sentence

Encode RNN decoder Sentence [Venugopalan et. al.

NAACL’15]

RNN decoder Sentence [Venugopalan et. al. ICCV’

15] (this work)

RNN encoder

Page 19: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

S2VT Overview

LSTM LSTMLSTMLSTM LSTM LSTMLSTMLSTM

LSTM LSTMLSTMLSTM LSTM LSTMLSTMLSTM

CNN CNN CNN CNN

A man is talking

...

...Encoding stage

Decoding stage

Now decode it to a sentence!

Sequence to Sequence - Video to Text (S2VT)S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, K. Saenko

Page 20: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

Frames: RGB

CNN

1000 categories

CNN

Forward propagateOutput: “fc7” features

(activations before classification layer)

fc7: 4096 dimension “feature vector”

1. Train on Imagenet

2. Take activations from layer before classification

Page 21: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

Frames: Flow

CNN (modified AlexNet)

101 Action Classes

CNNForward propagate

Output: “fc7” features(activations before classification layer)

fc7: 4096 dimension “feature vector”

1. Train CNN on Activity classes

3. Take activations from layer before classification

2. Use optical flow to extract flow images.

UCF 101

[T. Brox et. al. ECCV ‘04]

Page 22: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

Dataset: Youtube

● A man is walking on a rope.● A man is walking across a rope.● A man is balancing on a rope.● A man is balancing on a rope at the beach.● A man walks on a tightrope at the beach.● A man is balancing on a volleyball net.● A man is walking on a rope held by poles● A man balanced on a wire.● The man is balancing on the wire.● A man is walking on a rope.● A man is standing in the sea shore.

● ~2000 clips● Avg. length: 11s per clip● ~40 sentence per clip● ~81,000 sentences

Page 23: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

Results (Youtube)

27.7

28.2

29.2

29.8

Mean-Pool (VGG)

S2VT (RGB+Flow)

S2VT (randomized)

S2VT (RGB)

METEOR: MT metric. Considers alignment, para-phrases and similarity.

Page 24: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM
Page 25: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

Evaluation: Movie Corpora

MPII-MD

● MPII, Germany● DVS alignment: semi-

automated and crowdsourced● 94 movies● 68,000 clips● Avg. length: 3.9s per clip● ~1 sentence per clip● 68,375 sentences

M-VAD

● Univ. of Montreal● DVS alignment: automated

speech extraction● 92 movies● 46,009 clips● Avg. length: 6.2s per clip● 1-2 sentences per clip● 56,634 sentences

Page 26: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

Movie Corpus - DVS

Processed: Looking troubled, someone descends the stairs.

Someone rushes into the courtyard. She then puts a head scarf on ...

Page 27: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

Results (MPII-MD Movie Corpus)

5.6

6.7

7.1

Best Prior Work[Rohrbach et al. CVPR’15]

Mean-Pool

S2VT (RGB)

Page 28: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

Results (M-VAD Movie Corpus)

4.3

6.1

6.7

Best Prior Work[Yao et al. ICCV’15]

Mean-Pool

S2VT (RGB)

Page 30: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

● What are the advantages/drawbacks of this approach?○ End-to-end, annotations

● Detaching recognition and generation.● Why only METEOR (not BLEU or other metrics)?● Domain adaptation, Re-use RNNs (youtube -> movies, activity recognition)● Languages other than English.● Features apart from Optical Flow, RGB; temporal representation.

Discussion

Sequence to Sequence - Video to Text (S2VT)S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, K. Saenko

Page 31: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM
Page 32: LSTMs Overview - University of Texas at Austinvision.cs.utexas.edu/381V-spring2016/slides/subha-lstm.pdf · Applications to handwriting and speech recognition. Chris’ Blog - LSTM

Code and more exampleshttp://vsubhashini.github.io/s2vt.html

Sequence to Sequence - Video to Text (S2VT)S. Venugopalan, M. Rohrbach, J. Donahue, R. Mooney, T. Darrell, K. Saenko